範例檔資料下載

dplyr 介紹:


dplyr 函數包是處理矩形資料(dataframetibble 格式)的利器, 函數大多以 function( 物標物件, 變數 ) 的格式做表達,形式簡潔可讀性高, 且函數間可透過 %>% (pipe) 運算符號(Ctrl + Shift + M) 組合,進而輕鬆實現彈性、高效率的資料處理。

安裝並載入函數包:

tidyverse中包含了dplyrggplot2,與stringr等數據處理常用包,直接呼叫即可,

細節可參考 官方網站

範例資料介紹:

  1. stockdata.csv : 2018-01-012018-12-31 部分台股上市公司代碼、產業別、每日調整後開高收低價格。 (資料來源TEJ+台灣經濟新報)

  2. eps.csv : 2018年台股上市公司全年累計每股盈餘(eps)。 (資料來源 TEJ+台灣經濟新報)

## [1] "tbl_df"     "tbl"        "data.frame"
## [1] "code"  "ind"   "date"  "open"  "high"  "low"   "close"
## [1] 28478     7
code ind date open high low close
1101 1 20180102 27.68 27.79 27.37 27.75
1102 1 20180102 25.90 26.00 25.67 25.81
1103 1 20180102 12.12 12.21 11.94 12.21
1104 1 20180102 20.66 20.70 20.57 20.66
1108 1 20180102 8.35 8.40 8.35 8.40
1109 1 20180102 12.68 13.17 12.68 13.02
1110 1 20180102 14.56 14.56 14.31 14.46
1201 2 20180102 20.65 20.65 20.35 20.50
1203 2 20180102 23.97 24.21 23.97 24.07
1210 2 20180102 27.46 27.54 27.33 27.42



dplyr常用函數介紹:


Arrange

將資料照所選欄位排序

## # A tibble: 28,478 x 7
##     code   ind     date  open  high   low close
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>
##  1  1101     1 20181228  30.6  30.9  30.6  30.8
##  2  1101     1 20181227  30.9  30.9  30.6  30.6
##  3  1101     1 20181226  30.9  31.0  30.6  30.6
##  4  1101     1 20181225  30.3  31.0  30.2  31.0
##  5  1101     1 20181224  30.8  31.1  30.7  30.9
##  6  1101     1 20181222  30.6  30.8  30.5  30.6
##  7  1101     1 20181221  30.6  30.8  30.5  30.6
##  8  1101     1 20181220  30.9  31.2  30.5  30.6
##  9  1101     1 20181219  30.9  31.1  30.8  31.1
## 10  1101     1 20181218  31.0  31.3  30.7  30.8
## # ... with 28,468 more rows

Select

選擇資料中所需欄位

## # A tibble: 28,478 x 3
##    close  code   ind
##    <dbl> <int> <int>
##  1  27.8  1101     1
##  2  25.8  1102     1
##  3  12.2  1103     1
##  4  20.7  1104     1
##  5   8.4  1108     1
##  6  13.0  1109     1
##  7  14.5  1110     1
##  8  20.5  1201     2
##  9  24.1  1203     2
## 10  27.4  1210     2
## # ... with 28,468 more rows
## [1] "code" "ind"  "date" "open"
## # A tibble: 28,478 x 4
##     code   ind     date  open
##    <int> <int>    <int> <dbl>
##  1  1101     1 20180102 27.7 
##  2  1102     1 20180102 25.9 
##  3  1103     1 20180102 12.1 
##  4  1104     1 20180102 20.7 
##  5  1108     1 20180102  8.35
##  6  1109     1 20180102 12.7 
##  7  1110     1 20180102 14.6 
##  8  1201     2 20180102 20.6 
##  9  1203     2 20180102 24.0 
## 10  1210     2 20180102 27.5 
## # ... with 28,468 more rows
## # A tibble: 28,478 x 6
##     code     date  open  high   low close
##    <int>    <int> <dbl> <dbl> <dbl> <dbl>
##  1  1101 20180102 27.7   27.8 27.4   27.8
##  2  1102 20180102 25.9   26   25.7   25.8
##  3  1103 20180102 12.1   12.2 11.9   12.2
##  4  1104 20180102 20.7   20.7 20.6   20.7
##  5  1108 20180102  8.35   8.4  8.35   8.4
##  6  1109 20180102 12.7   13.2 12.7   13.0
##  7  1110 20180102 14.6   14.6 14.3   14.5
##  8  1201 20180102 20.6   20.6 20.4   20.5
##  9  1203 20180102 24.0   24.2 24.0   24.1
## 10  1210 20180102 27.5   27.5 27.3   27.4
## # ... with 28,468 more rows

Filter

依照條件篩選資料

## # A tibble: 48 x 7
##     code   ind     date  open  high   low close
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>
##  1  1590     5 20180102  523.  526.  501.  511.
##  2  1590     5 20180103  516.  520.  511.  515.
##  3  1590     5 20180104  518.  522.  513.  520.
##  4  1590     5 20180105  525.  525.  509.  512.
##  5  1590     5 20180108  515.  515.  498.  501.
##  6  1590     5 20180226  506.  512.  503.  510.
##  7  1590     5 20180227  518.  529.  518.  522.
##  8  1590     5 20180301  522.  533.  513.  516.
##  9  1590     5 20180302  508.  517.  508.  513.
## 10  1590     5 20180329  488.  507.  484.  507.
## # ... with 38 more rows
## # A tibble: 163 x 7
##     code   ind     date  open  high   low close
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>
##  1  1101     1 20180424  30.0  31.1  30.0  31.0
##  2  1101     1 20180425  30.9  31.1  30.3  30.5
##  3  1101     1 20180426  30.6  31.1  30.6  30.8
##  4  1101     1 20180427  31.1  31.1  30.4  30.6
##  5  1101     1 20180430  30.9  31.0  30.5  31.0
##  6  1101     1 20180502  31.0  31.4  30.9  31.0
##  7  1101     1 20180504  30.1  30.4  30.0  30.0
##  8  1101     1 20180507  30.4  31.4  30.4  30.7
##  9  1101     1 20180508  30.7  31.3  30.4  31.0
## 10  1101     1 20180509  30.9  30.9  30.7  30.8
## # ... with 153 more rows
## # A tibble: 741 x 7
##     code   ind     date  open  high   low close
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>
##  1  1101     1 20180102  27.7  27.8  27.4  27.8
##  2  1102     1 20180102  25.9  26    25.7  25.8
##  3  1103     1 20180102  12.1  12.2  11.9  12.2
##  4  1101     1 20180103  27.7  28.4  27.5  28.4
##  5  1102     1 20180103  25.9  26    25.7  26  
##  6  1103     1 20180103  12.2  12.2  12.0  12.1
##  7  1101     1 20180104  28.4  28.4  27.9  28.2
##  8  1102     1 20180104  26.1  26.4  26    26.2
##  9  1103     1 20180104  12.1  12.5  12.1  12.1
## 10  1101     1 20180105  28.5  28.5  28.1  28.3
## # ... with 731 more rows

Mutate

將現有資料欄位經過計算後,形成新欄位

## # A tibble: 28,478 x 9
##     code   ind     date  open  high   low close maxDif log_maxDif
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>  <dbl>      <dbl>
##  1  1101     1 20180102 27.7   27.8 27.4   27.8  0.420     -0.868
##  2  1102     1 20180102 25.9   26   25.7   25.8  0.330     -1.11 
##  3  1103     1 20180102 12.1   12.2 11.9   12.2  0.27      -1.31 
##  4  1104     1 20180102 20.7   20.7 20.6   20.7  0.130     -2.04 
##  5  1108     1 20180102  8.35   8.4  8.35   8.4  0.05      -3.00 
##  6  1109     1 20180102 12.7   13.2 12.7   13.0  0.49      -0.713
##  7  1110     1 20180102 14.6   14.6 14.3   14.5  0.25      -1.39 
##  8  1201     2 20180102 20.6   20.6 20.4   20.5  0.300     -1.20 
##  9  1203     2 20180102 24.0   24.2 24.0   24.1  0.24      -1.43 
## 10  1210     2 20180102 27.5   27.5 27.3   27.4  0.21      -1.56 
## # ... with 28,468 more rows
## # A tibble: 28,478 x 8
##     code   ind     date  open  high   low close     n
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl> <int>
##  1  1101     1 20180102 27.7   27.8 27.4   27.8     1
##  2  1102     1 20180102 25.9   26   25.7   25.8     2
##  3  1103     1 20180102 12.1   12.2 11.9   12.2     3
##  4  1104     1 20180102 20.7   20.7 20.6   20.7     4
##  5  1108     1 20180102  8.35   8.4  8.35   8.4     5
##  6  1109     1 20180102 12.7   13.2 12.7   13.0     6
##  7  1110     1 20180102 14.6   14.6 14.3   14.5     7
##  8  1201     2 20180102 20.6   20.6 20.4   20.5     8
##  9  1203     2 20180102 24.0   24.2 24.0   24.1     9
## 10  1210     2 20180102 27.5   27.5 27.3   27.4    10
## # ... with 28,468 more rows
## # A tibble: 28,478 x 8
##     code   ind     date  open  high   low close lClose
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1  1101     1 20180102 27.7   27.8 27.4   27.8   3.32
##  2  1102     1 20180102 25.9   26   25.7   25.8   3.25
##  3  1103     1 20180102 12.1   12.2 11.9   12.2   2.50
##  4  1104     1 20180102 20.7   20.7 20.6   20.7   3.03
##  5  1108     1 20180102  8.35   8.4  8.35   8.4   2.13
##  6  1109     1 20180102 12.7   13.2 12.7   13.0   2.57
##  7  1110     1 20180102 14.6   14.6 14.3   14.5   2.67
##  8  1201     2 20180102 20.6   20.6 20.4   20.5   3.02
##  9  1203     2 20180102 24.0   24.2 24.0   24.1   3.18
## 10  1210     2 20180102 27.5   27.5 27.3   27.4   3.31
## # ... with 28,468 more rows

Group_By

將資料依照特定欄位內容分組

## # A tibble: 1 x 7
##    code   ind     date  open  high   low close
##   <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>
## 1  1417     4 20181030  4.23  4.29  4.19  4.26
## # A tibble: 6 x 7
## # Groups:   ind [4]
##    code   ind     date  open  high   low close
##   <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>
## 1  1108     1 20181026  7.15  7.16  6.99  6.99
## 2  1217     2 20181030  6.61  6.68  6.55  6.58
## 3  1417     4 20181030  4.23  4.29  4.19  4.26
## 4  1538     5 20181102  4.7   4.7   4.7   4.7 
## 5  1538     5 20181105  4.7   4.7   4.7   4.7 
## 6  1538     5 20181106  4.7   4.7   4.7   4.7

Summarise

將資料做總結,常搭配group_by()使用

## # A tibble: 4 x 2
##     ind sum_close
##   <int>     <dbl>
## 1     1      18.6
## 2     2      39.4
## 3     4      27.7
## 4     5      55.0



其他常用函數


lag

## # A tibble: 28,478 x 8
## # Groups:   code [116]
##     code   ind     date  open  high   low close      ret
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>    <dbl>
##  1  1101     1 20180102  27.7  27.8  27.4  27.8 NA      
##  2  1101     1 20180103  27.7  28.4  27.5  28.4  0.0220 
##  3  1101     1 20180104  28.4  28.4  27.9  28.2 -0.00388
##  4  1101     1 20180105  28.5  28.5  28.1  28.3  0.00248
##  5  1101     1 20180108  28.4  29.1  28.2  29.0  0.0244 
##  6  1101     1 20180109  28.9  29.3  28.8  29.1  0.00241
##  7  1101     1 20180110  29.1  29.2  28.9  29.2  0.00413
##  8  1101     1 20180111  29.1  29.1  28.4  28.5 -0.0247 
##  9  1101     1 20180112  28.7  28.8  28.4  28.6  0.00246
## 10  1101     1 20180115  28.6  28.7  28.5  28.7  0.00665
## # ... with 28,468 more rows

distinct

## # A tibble: 4 x 1
##     ind
##   <int>
## 1     1
## 2     2
## 3     4
## 4     5

left_join

## # A tibble: 28,478 x 8
##     code   ind     date  open  high   low close   eps
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1101     1 20180102 27.7   27.8 27.4   27.8  2.03
##  2  1102     1 20180102 25.9   26   25.7   25.8  1.74
##  3  1103     1 20180102 12.1   12.2 11.9   12.2  1.39
##  4  1104     1 20180102 20.7   20.7 20.6   20.7  2.16
##  5  1108     1 20180102  8.35   8.4  8.35   8.4 -0.25
##  6  1109     1 20180102 12.7   13.2 12.7   13.0  1.32
##  7  1110     1 20180102 14.6   14.6 14.3   14.5  0.11
##  8  1201     2 20180102 20.6   20.6 20.4   20.5  1.31
##  9  1203     2 20180102 24.0   24.2 24.0   24.1  1.86
## 10  1210     2 20180102 27.5   27.5 27.3   27.4  3.28
## # ... with 28,468 more rows

更多join系列動圖看 這裡

實作範例:


情境:

希望以股價、EPS資料做分析,研究市場是否會給予產業內獲利的公司股價正面反應?

又不同產業間,狀況是否有差異?

請使用本範例提供的stockdata.csveps.csv進行操作,

獲利的定義為eps大於10,正面反應的定義為全年累計報酬為正。

步驟如下:

  1. 請將eps.csv合併至stockdata.csv並篩選出eps大於10者。(left_join, filter)

  2. 請將個股分群後依照代碼、日期排序並計算單日簡單報酬、全年累積報酬。(gorup_by, arrange, mutate, lag, cumprod)

  3. 計算各產業平均全年累計報酬。(group_by, summarize)