การจัดการข้อมูลด้วย tidyverse ใน R

อันนี้เป็นฉบับร่างนะครับ ภาษาไทยยังไม่สมบูรณ์ จะนำมาเติมให้เต็มภายหลังนะครับ

0. Loading Tidyverse

tidyverse เป็นชุดของ packages ที่ทำงานด้วยกันได้อย่างดีในการนำเข้า จัดการ วิเคราะห์ และแสดงผลข้อมูล ได้เป็นอย่างนี้ แม้ว่าส่วนใหญ่เราจะใช้ package ที่ชื่อว่า dplyr และ ggplot2 เป็นหลัก แต่จะมี package อื่น ๆ ใน tidyverse อีกที่จะเป็นประโยชน์ต่อการทำงานของของเรา ดังนั้นเราจะโหลด package tidyverse ทีเดียวเลย จะได้ไม่ต้องทำงานหลายรอบ!

{width = 250px}

library(tidyverse)

## ── Attaching packages ─────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.3.0

## Warning: package 'ggplot2' was built under R version 3.5.2

## Warning: package 'tibble' was built under R version 3.5.2

## Warning: package 'tidyr' was built under R version 3.5.2

## Warning: package 'purrr' was built under R version 3.5.2

## Warning: package 'dplyr' was built under R version 3.5.2

## Warning: package 'stringr' was built under R version 3.5.2

## ── Conflicts ────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

output ตรงนี้จะแสดงว่า package ย่อย ๆ ที่โหลดมามีอะไรบ้าง ไม่ใช่ error message ไม่ต้องตกใจ

1. Importing Data

R สามารถนำเข้าข้อมูลได้หลายแบบตั้งแต่สกุล .txt จนไปถึงสกุล .xlsx แต่ว่ารูปแบบไฟล์ที่เสถียรและมีปัญหาน้อยที่สุดสำหรับงานวิจัยสายเกษตรและชีววิทยาส่วนใหญ่จะเป็นอยู่ในรูป comma separate value หรือ CSV เราสามารถใช้คำสั่ง บันทึกเป็น หรือ Save As แล้วเลือก format .csv เพื่อบันทึกไฟล์ Excel มาเป็น CSV ได้

เราจะใช้ไฟล์ตัวอย่าง คือ australian_soybean.csv มานำเข้าข้อมูลใน R อัน

สำหรับผู้เคยใช้ R มาก่อนหน้านี้ เราอาจจะเคยใช้คำสั่ง read.csv มาก่อน คำสั่งมีปัญหาแปลก ๆ หลายอย่าง เราเลยจะใช้คำสั่งใหม่ที่เรียกว่า read_csv() จาก package readr มาแทน ซึ่งจะได้ข้อมูลออกเป็น tibble แทน data.frame ที่เราอาจจะเคยชิน

aus.soy <- read_csv("australian_soybean.csv")

## Parsed with column specification:
## cols(
##   env = col_character(),
##   loc = col_character(),
##   year = col_double(),
##   gen = col_character(),
##   yield = col_double(),
##   height = col_double(),
##   lodging = col_double(),
##   size = col_double(),
##   protein = col_double(),
##   oil = col_double()
## )

พอสั่งแล้วจะมีตัวอักษรขึ้นมากมาย เริ่มต้นด้วย “Parsed with….” ไม่ต้องตกใจ ไม่ใช่ error เพียงแต่แค่โชว์ว่าคำสั่งนี้ทำอะไรกับข้อมูลเราไปแล้วบ้าง

2. Preview the data and tibble format

เราได้ assign ไฟล์ข้อมูลเป็น object ใหม่ชื่อ aus.soy ไปเรียบร้อยแล้ว ลองมาดูข้อมูลกันว่าหน่้าตาเป็นอย่างไร

aus.soy

## # A tibble: 464 x 10
##    env   loc    year gen   yield height lodging  size protein   oil
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L70   Lawes  1970 G01    2.39  1.44     4.25  8.45    36.7  20.9
##  2 L70   Lawes  1970 G02    2.28  1.45     4.25  9.95    37.6  20.7
##  3 L70   Lawes  1970 G03    2.57  1.46     3.75 10.8     37.8  21.3
##  4 L70   Lawes  1970 G04    2.88  1.26     3.5  10.0     38.4  22.0
##  5 L70   Lawes  1970 G05    2.39  1.34     3.5  11       37.5  22.1
##  6 L70   Lawes  1970 G06    2.41  1.36     4    11.8     38.2  21.2
##  7 L70   Lawes  1970 G07    2.70  1.3      3    11.8     37.4  21.7
##  8 L70   Lawes  1970 G08    2.46  0.955    3.25 10       35.2  21.1
##  9 L70   Lawes  1970 G09    2.57  1.03     3    11.2     35.9  21.5
## 10 L70   Lawes  1970 G10    2.98  1.16     3.75 10.8     39.7  20.4
## # … with 454 more rows

ข้อมูลที่เห็นอยู่ในรูปแบบที่เรียกว่า tibble data format ซึ่งจะโชว์ข้อมูลของแค่ 10 แถวแรกเท่านั้น และจำนวนคอลัมน์ขึ้นอยู่กับความกว้างหน้าจอของเรา บรรทัดบนสุดเราจะเห็นข้อความลักษณะนี้:

# A tibble: 464 x 10

ตรงนี้บอกเราว่า ข้อมูลเรามี 464 แถว (rows หรือ observations) และ มี 10 คอลัมน์ (ตัวแปร) เรายังสามารถหาขนาดของตารางนี้ได้โดยการใช้คำสั่ง dim (ย่อมาจาก “dimension”)

dim(aus.soy)

## [1] 464  10

ถ้าอยากดูข้อมูลมากกว่า 10 แถว เราสามารถเพิ่มรายละเอียดได้โดยการใส่จำนวนแถว (n) ที่ต้องการจะแสดงผลลงไป

print(aus.soy, n = 20)

## # A tibble: 464 x 10
##    env   loc    year gen   yield height lodging  size protein   oil
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L70   Lawes  1970 G01   2.39   1.44     4.25  8.45    36.7  20.9
##  2 L70   Lawes  1970 G02   2.28   1.45     4.25  9.95    37.6  20.7
##  3 L70   Lawes  1970 G03   2.57   1.46     3.75 10.8     37.8  21.3
##  4 L70   Lawes  1970 G04   2.88   1.26     3.5  10.0     38.4  22.0
##  5 L70   Lawes  1970 G05   2.39   1.34     3.5  11       37.5  22.1
##  6 L70   Lawes  1970 G06   2.41   1.36     4    11.8     38.2  21.2
##  7 L70   Lawes  1970 G07   2.70   1.3      3    11.8     37.4  21.7
##  8 L70   Lawes  1970 G08   2.46   0.955    3.25 10       35.2  21.1
##  9 L70   Lawes  1970 G09   2.57   1.03     3    11.2     35.9  21.5
## 10 L70   Lawes  1970 G10   2.98   1.16     3.75 10.8     39.7  20.4
## 11 L70   Lawes  1970 G11   1.66   1.42     4.5   6.95    40.2  19.1
## 12 L70   Lawes  1970 G12   1.96   1.44     4.25  8.35    40.3  18.7
## 13 L70   Lawes  1970 G13   1.47   1.58     4.5   9.3     41.2  19.2
## 14 L70   Lawes  1970 G14   2.72   1.33     4     8.25    37.4  20.8
## 15 L70   Lawes  1970 G15   2.22   1.37     4.25  9.3     36.6  20.7
## 16 L70   Lawes  1970 G16   1.66   1.7      4.75  9.15    39.6  20.4
## 17 L70   Lawes  1970 G17   1.72   1.28     4.25  8.4     43.7  17.5
## 18 L70   Lawes  1970 G18   1.43   1.50     4.5   7.8     42.4  17.4
## 19 L70   Lawes  1970 G19   1.43   1.37     4.25  8.5     41.0  19.4
## 20 L70   Lawes  1970 G20   0.998  1.32     4.5   8.95    41.1  18.2
## # … with 444 more rows

นอกจากนี้เรายังสามารถไปที่ tab ที่ชื่อว่า Environment ใน RStudio ได้ และ double-clicking ที่ aus.soyจะมี tab ใหม่ขึ้นมาแสดงผลเป็นตารางให้เราดู

แถวบนสุดของ tibble เรายังจะเห็นตัวอักษรย่อในแต่ละคอลััมน์ ตัวอักษรเหล่านี้แสดงชนิดของตัวแปรในแต่ละคอลัมน์ ตัวอย่างเช่น

dbl ย่อมาจาก doubles สำหรับตัวเลขจำนวนจริง
chr ย่อมาจาก character สำหรับตัวหนังสือ

ข้อมูลนี้เป็นข้อมูลผลผลิต (yield) และลักษณะอื่น ๆ ของถั่วเหลือง 58 พันธุ์ (gen) ที่ปลูกใน 4 พื้นที่ของ Queensland, Australia ระหว่างปี 1970-1971 เป็นข้อมูลที่ได้มาจากเวบนี้ครับ http://three-mode.leidenuniv.nl/data/soybeaninf.htm

3. Subsetting data by number

เราสามารถเลือกดูข้อมูลบางส่วนจากตารางใหญ่ได้ โดยการใช้วงเล็บสี่เหลี่ยม [ ]

aus.soy[1, ]

## # A tibble: 1 x 10
##   env   loc    year gen   yield height lodging  size protein   oil
##   <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1 L70   Lawes  1970 G01    2.39   1.44    4.25  8.45    36.7  20.9

ตัวเลขก่อนเครื่องหมาย , เป็นตัวเลขของแถวและตัวเลขหลังจากนั้นเป็นตัวเลขคอลัมน์ ใน code ด้านบน เราจะเรียกทุกคอลัมน์ของแถวที่ 1 ออกมา เราจึงเว้นส่วนของคอลัมน์ไว้

นอกจากนี้เรายังสามารถใช้ช่วง หรือ เลือกแถวที่เราต้องการได้

aus.soy[1:5, ]

## # A tibble: 5 x 10
##   env   loc    year gen   yield height lodging  size protein   oil
##   <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1 L70   Lawes  1970 G01    2.39   1.44    4.25  8.45    36.7  20.9
## 2 L70   Lawes  1970 G02    2.28   1.45    4.25  9.95    37.6  20.7
## 3 L70   Lawes  1970 G03    2.57   1.46    3.75 10.8     37.8  21.3
## 4 L70   Lawes  1970 G04    2.88   1.26    3.5  10.0     38.4  22.0
## 5 L70   Lawes  1970 G05    2.39   1.34    3.5  11       37.5  22.1

aus.soy[c(1,3,5), ]

## # A tibble: 3 x 10
##   env   loc    year gen   yield height lodging  size protein   oil
##   <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1 L70   Lawes  1970 G01    2.39   1.44    4.25  8.45    36.7  20.9
## 2 L70   Lawes  1970 G03    2.57   1.46    3.75 10.8     37.8  21.3
## 3 L70   Lawes  1970 G05    2.39   1.34    3.5  11       37.5  22.1

เรียกข้อมูลในคอลัมน์ก็คล้ายกัน เพียงแต่เอาตัวเลข หรือ ชื่อคอลัมมน์ไปไว้ด้านหลัง , หรือไม่ก็ใช้ $ ตามด้วยชื่อคอลัมน์

aus.soy[,1]

## # A tibble: 464 x 1
##    env  
##    <chr>
##  1 L70  
##  2 L70  
##  3 L70  
##  4 L70  
##  5 L70  
##  6 L70  
##  7 L70  
##  8 L70  
##  9 L70  
## 10 L70  
## # … with 454 more rows

aus.soy[,c(1,3,5)]

## # A tibble: 464 x 3
##    env    year yield
##    <chr> <dbl> <dbl>
##  1 L70    1970  2.39
##  2 L70    1970  2.28
##  3 L70    1970  2.57
##  4 L70    1970  2.88
##  5 L70    1970  2.39
##  6 L70    1970  2.41
##  7 L70    1970  2.70
##  8 L70    1970  2.46
##  9 L70    1970  2.57
## 10 L70    1970  2.98
## # … with 454 more rows

aus.soy[,c("loc","year")]

## # A tibble: 464 x 2
##    loc    year
##    <chr> <dbl>
##  1 Lawes  1970
##  2 Lawes  1970
##  3 Lawes  1970
##  4 Lawes  1970
##  5 Lawes  1970
##  6 Lawes  1970
##  7 Lawes  1970
##  8 Lawes  1970
##  9 Lawes  1970
## 10 Lawes  1970
## # … with 454 more rows

aus.soy$env

##   [1] "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70"
##  [12] "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70"
##  [23] "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70"
##  [34] "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70"
##  [45] "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70" "L70"
##  [56] "L70" "L70" "L70" "B70" "B70" "B70" "B70" "B70" "B70" "B70" "B70"
##  [67] "B70" "B70" "B70" "B70" "B70" "B70" "B70" "B70" "B70" "B70" "B70"
##  [78] "B70" "B70" "B70" "B70" "B70" "B70" "B70" "B70" "B70" "B70" "B70"
##  [89] "B70" "B70" "B70" "B70" "B70" "B70" "B70" "B70" "B70" "B70" "B70"
## [100] "B70" "B70" "B70" "B70" "B70" "B70" "B70" "B70" "B70" "B70" "B70"
## [111] "B70" "B70" "B70" "B70" "B70" "B70" "N70" "N70" "N70" "N70" "N70"
## [122] "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70"
## [133] "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70"
## [144] "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70"
## [155] "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70"
## [166] "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70" "N70" "R70" "R70"
## [177] "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70"
## [188] "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70"
## [199] "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70"
## [210] "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70"
## [221] "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70" "R70"
## [232] "R70" "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71"
## [243] "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71"
## [254] "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71"
## [265] "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71"
## [276] "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71" "L71"
## [287] "L71" "L71" "L71" "L71" "B71" "B71" "B71" "B71" "B71" "B71" "B71"
## [298] "B71" "B71" "B71" "B71" "B71" "B71" "B71" "B71" "B71" "B71" "B71"
## [309] "B71" "B71" "B71" "B71" "B71" "B71" "B71" "B71" "B71" "B71" "B71"
## [320] "B71" "B71" "B71" "B71" "B71" "B71" "B71" "B71" "B71" "B71" "B71"
## [331] "B71" "B71" "B71" "B71" "B71" "B71" "B71" "B71" "B71" "B71" "B71"
## [342] "B71" "B71" "B71" "B71" "B71" "B71" "B71" "N71" "N71" "N71" "N71"
## [353] "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71"
## [364] "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71"
## [375] "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71"
## [386] "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71"
## [397] "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71" "N71" "R71"
## [408] "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71"
## [419] "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71"
## [430] "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71"
## [441] "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71"
## [452] "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71" "R71"
## [463] "R71" "R71"

พอดูออกไหมครับว่าเราทำอะไรกันอยู่ในแต่ละบรรทัด

เรายังสามารถเลือกแถวและคอลัมน์พร้อมกันได้ด้วย

aus.soy[1:5,1:3]

## # A tibble: 5 x 3
##   env   loc    year
##   <chr> <chr> <dbl>
## 1 L70   Lawes  1970
## 2 L70   Lawes  1970
## 3 L70   Lawes  1970
## 4 L70   Lawes  1970
## 5 L70   Lawes  1970

Code ด้านบนที่ผ่านมาทั้งหมด เป็นการ แสดง ผลการย่อยข้อมูลให้เหลือส่วนเล็ก ๆ แต่ถ้าเราจะนำข้อมูลนี้ไปทำอะไรต่อ เราจะต้อง assign ส่วนที่ย่อยแล้วให้เป็น object ใหม่ขึ้นมา

obj1 <- aus.soy[1:5, 5:6]
obj1

## # A tibble: 5 x 2
##   yield height
##   <dbl>  <dbl>
## 1  2.39   1.44
## 2  2.28   1.45
## 3  2.57   1.46
## 4  2.88   1.26
## 5  2.39   1.34

คำสั่งด้านบนนี้จะสร้าง object ใหม่ชื่อ obj1 ซึ่งบันทึกข้อมูลแถวที่ 1 ถึง 5 และ คอลัมน์ที่ 5 และ 6 (ซึ่งก็คือ yield และ height) เราสามารถนำ object นี้ไปทำอย่างอื่นต่อได้ เช่นการหาค่าเฉลี่ย หรือ sd

mean(obj1$yield)

## [1] 2.501

sd(obj1$height)

## [1] 0.08867074

Exercise

ลองย่อย (subset) ข้อมูล aus.soy ให้เหลือเป็นข้อมูลดังนี้

แถว 15 ถึง 20 และ 100 ถึง 150 และเลือกคอลัมน์ทั้งหมด

คำตอบควรมีขนาดตาราง 57x10.

เลือกเฉพาะคอลัมน์ height และ lodging และเลือกทุกแถว

คำตอบควรมีขนาดตาราง 464x2.

เลือกเฉพาะคอลัมน์ protein และ oil และ แถว 10 ถึง 100

คำตอบควรมีขนาดตาราง 91x2.

4. Data manipulation: Five main verbs

ในส่วนนี้เราจะเรียน verb หลัก ๆ จาก package dplyr ที่จะทำให้เรามีทำงานมีประสิทธิภาพมากขึ้น

five verb

4.1 `filter()`

เราสามารถใช้คำสั่ง filter() ในการเลือกแถว (rows) จากค่าของข้อมูลในแต่ละคอลัมน์ เช่น เราต้องการเลือกแถวเฉพาะจากปี 1970

aus.soy.1970 <- filter(aus.soy, year == 1970)
aus.soy.1970

## # A tibble: 232 x 10
##    env   loc    year gen   yield height lodging  size protein   oil
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L70   Lawes  1970 G01    2.39  1.44     4.25  8.45    36.7  20.9
##  2 L70   Lawes  1970 G02    2.28  1.45     4.25  9.95    37.6  20.7
##  3 L70   Lawes  1970 G03    2.57  1.46     3.75 10.8     37.8  21.3
##  4 L70   Lawes  1970 G04    2.88  1.26     3.5  10.0     38.4  22.0
##  5 L70   Lawes  1970 G05    2.39  1.34     3.5  11       37.5  22.1
##  6 L70   Lawes  1970 G06    2.41  1.36     4    11.8     38.2  21.2
##  7 L70   Lawes  1970 G07    2.70  1.3      3    11.8     37.4  21.7
##  8 L70   Lawes  1970 G08    2.46  0.955    3.25 10       35.2  21.1
##  9 L70   Lawes  1970 G09    2.57  1.03     3    11.2     35.9  21.5
## 10 L70   Lawes  1970 G10    2.98  1.16     3.75 10.8     39.7  20.4
## # … with 222 more rows

สังเกตว่าแถวบนสุดจะเขียนว่า # A tibble: 232 x 10 ซึ่งแสดงว่าตารางของเราตอนนี้เหลือแค่ 232 observations ซึ่งมาจากปี 1970 เท่านั้น

== เป็นวิธีการบอกคำสั่ง filter() ของเราว่า ให้ปี เท่ากับ 1970 เราไม่สามารถใช่้เครื่องหมาย = เพราะว่าเครื่องหมายนี้มีความหมายเฉพาะอย่างอื่นใน R

เครื่องหมายที่สร้างเงื่อนไขใน R ได้มีดังนี้

!= หมายถึง ไม่เท่ากับ
> หมายถึง มากกว่า
< หมายถึง น้อยกว่า
>= หมายถึง มากกว่าหรือเท่ากับ
<= หมายถึง น้อยกว่าหรือเท่ากับ
%in% หมายถึง เป็นส่วนหนึ่งของเวคเตอร์ที่ตามมา

ลองรันแต่ละบรรทัดดูแล้วดูว่าเกิดอะไรขึ้น

filter(aus.soy, oil > 20)

## # A tibble: 217 x 10
##    env   loc    year gen   yield height lodging  size protein   oil
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L70   Lawes  1970 G01    2.39  1.44     4.25  8.45    36.7  20.9
##  2 L70   Lawes  1970 G02    2.28  1.45     4.25  9.95    37.6  20.7
##  3 L70   Lawes  1970 G03    2.57  1.46     3.75 10.8     37.8  21.3
##  4 L70   Lawes  1970 G04    2.88  1.26     3.5  10.0     38.4  22.0
##  5 L70   Lawes  1970 G05    2.39  1.34     3.5  11       37.5  22.1
##  6 L70   Lawes  1970 G06    2.41  1.36     4    11.8     38.2  21.2
##  7 L70   Lawes  1970 G07    2.70  1.3      3    11.8     37.4  21.7
##  8 L70   Lawes  1970 G08    2.46  0.955    3.25 10       35.2  21.1
##  9 L70   Lawes  1970 G09    2.57  1.03     3    11.2     35.9  21.5
## 10 L70   Lawes  1970 G10    2.98  1.16     3.75 10.8     39.7  20.4
## # … with 207 more rows

filter(aus.soy, year != 1970)

## # A tibble: 232 x 10
##    env   loc    year gen   yield height lodging  size protein   oil
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L71   Lawes  1971 G01    2.79  0.97     2.5   8.35    39.4  18.7
##  2 L71   Lawes  1971 G02    2.62  0.785    2.25  9.75    40.6  19.8
##  3 L71   Lawes  1971 G03    2.48  0.955    2.25 11.4     37.4  20.9
##  4 L71   Lawes  1971 G04    2.92  0.76     1.25 11.1     36.9  20.3
##  5 L71   Lawes  1971 G05    2.14  0.76     1.75 11.9     38.7  20.9
##  6 L71   Lawes  1971 G06    2.86  0.65     1.75 12.2     38.1  20.7
##  7 L71   Lawes  1971 G07    2.97  0.8      1.5  11.2     37.4  21.0
##  8 L71   Lawes  1971 G08    1.65  0.9      2.5  11.8     36.6  19.7
##  9 L71   Lawes  1971 G09    3.1   0.825    2    12.2     37.1  20.3
## 10 L71   Lawes  1971 G10    3.00  0.785    1.75 12.0     37.8  19.5
## # … with 222 more rows

filter(aus.soy, gen %in% c("G01","G02"))

## # A tibble: 16 x 10
##    env   loc         year gen   yield height lodging  size protein   oil
##    <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L70   Lawes       1970 G01   2.39   1.44     4.25  8.45    36.7  20.9
##  2 L70   Lawes       1970 G02   2.28   1.45     4.25  9.95    37.6  20.7
##  3 B70   Brookstead  1970 G01   1.25   1.01     3.25  8.85    39.5  18.8
##  4 B70   Brookstead  1970 G02   1.17   1.13     2.75  8.9     38.6  19.8
##  5 N70   Nambour     1970 G01   2.26   0.75     2.25  9.25    34.2  22.3
##  6 N70   Nambour     1970 G02   2.16   0.71     2     9.35    37.6  22.4
##  7 R70   RedlandBay  1970 G01   0.778  0.9      3.25  6.25    40.8  15.9
##  8 R70   RedlandBay  1970 G02   1.09   0.9      3.75  7.35    41.0  17.6
##  9 L71   Lawes       1971 G01   2.79   0.97     2.5   8.35    39.4  18.7
## 10 L71   Lawes       1971 G02   2.62   0.785    2.25  9.75    40.6  19.8
## 11 B71   Brookstead  1971 G01   2.56   0.965    3.25  8.9     40.3  18.2
## 12 B71   Brookstead  1971 G02   2.34   0.835    2.75 10.9     41.8  18.8
## 13 N71   Nambour     1971 G01   2.10   0.725    1.25  7.3     38.5  19.3
## 14 N71   Nambour     1971 G02   2.6    0.595    1     9.45    41.0  20.7
## 15 R71   RedlandBay  1971 G01   1.18   0.97     2.25  6.75    42.1  16.9
## 16 R71   RedlandBay  1971 G02   1.90   0.74     1.75  7.95    42.7  18.9

เราสามารถใช้หลายเงื่อนไขพร้อมกันได้โดยการใส่เงื่อนไขต่อ ๆ กันไปแบบนี้

filter(aus.soy, oil > 20, year != 1970)

## # A tibble: 82 x 10
##    env   loc    year gen   yield height lodging  size protein   oil
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L71   Lawes  1971 G03    2.48  0.955    2.25  11.4    37.4  20.9
##  2 L71   Lawes  1971 G04    2.92  0.76     1.25  11.1    36.9  20.3
##  3 L71   Lawes  1971 G05    2.14  0.76     1.75  11.9    38.7  20.9
##  4 L71   Lawes  1971 G06    2.86  0.65     1.75  12.2    38.1  20.7
##  5 L71   Lawes  1971 G07    2.97  0.8      1.5   11.2    37.4  21.0
##  6 L71   Lawes  1971 G09    3.1   0.825    2     12.2    37.1  20.3
##  7 L71   Lawes  1971 G24    2.91  0.84     1.75  10.2    40.6  20.2
##  8 L71   Lawes  1971 G44    2.77  0.725    2     14      36.9  22.2
##  9 L71   Lawes  1971 G45    3.82  0.56     1.25  19.6    36.3  23.7
## 10 L71   Lawes  1971 G46    3.00  0.52     1.25  15.8    36.8  23.2
## # … with 72 more rows

code ด้านบนนี้จะเลือกเฉพาะข้อมูลที่มี oil > 20 และ ไม่ได้มาจากปี 1970 เราสามารถใส่เครื่องหมาย & ลองไปแทน , ได้ มีความหมายเหมือนกัน

filter(aus.soy, oil > 20 & year != 1970)

## # A tibble: 82 x 10
##    env   loc    year gen   yield height lodging  size protein   oil
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L71   Lawes  1971 G03    2.48  0.955    2.25  11.4    37.4  20.9
##  2 L71   Lawes  1971 G04    2.92  0.76     1.25  11.1    36.9  20.3
##  3 L71   Lawes  1971 G05    2.14  0.76     1.75  11.9    38.7  20.9
##  4 L71   Lawes  1971 G06    2.86  0.65     1.75  12.2    38.1  20.7
##  5 L71   Lawes  1971 G07    2.97  0.8      1.5   11.2    37.4  21.0
##  6 L71   Lawes  1971 G09    3.1   0.825    2     12.2    37.1  20.3
##  7 L71   Lawes  1971 G24    2.91  0.84     1.75  10.2    40.6  20.2
##  8 L71   Lawes  1971 G44    2.77  0.725    2     14      36.9  22.2
##  9 L71   Lawes  1971 G45    3.82  0.56     1.25  19.6    36.3  23.7
## 10 L71   Lawes  1971 G46    3.00  0.52     1.25  15.8    36.8  23.2
## # … with 72 more rows

ถ้าเราต้องการใช้ หรือ ให้ใช้เครื่องหมาย |

filter(aus.soy, oil > 20 | year != 1970)

## # A tibble: 367 x 10
##    env   loc    year gen   yield height lodging  size protein   oil
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L70   Lawes  1970 G01    2.39  1.44     4.25  8.45    36.7  20.9
##  2 L70   Lawes  1970 G02    2.28  1.45     4.25  9.95    37.6  20.7
##  3 L70   Lawes  1970 G03    2.57  1.46     3.75 10.8     37.8  21.3
##  4 L70   Lawes  1970 G04    2.88  1.26     3.5  10.0     38.4  22.0
##  5 L70   Lawes  1970 G05    2.39  1.34     3.5  11       37.5  22.1
##  6 L70   Lawes  1970 G06    2.41  1.36     4    11.8     38.2  21.2
##  7 L70   Lawes  1970 G07    2.70  1.3      3    11.8     37.4  21.7
##  8 L70   Lawes  1970 G08    2.46  0.955    3.25 10       35.2  21.1
##  9 L70   Lawes  1970 G09    2.57  1.03     3    11.2     35.9  21.5
## 10 L70   Lawes  1970 G10    2.98  1.16     3.75 10.8     39.7  20.4
## # … with 357 more rows

สังเกตว่าตารางที่ได้ใหม่นี้ใหญ่ขึ้น (367x10) เพราะว่าเราใช้เงื่อนไขที่เป็น หรือ แทน และ

Exercise

ลองใช้คำสั่ง filter ตอบคำถามต่อไๆปนี้

มีข้อมูลที่แถวที่มาจากพันธุ์ G10 และ และมีโปรตีนมากกว่า 40.0?
มีข้อมูลที่แถวที่มาจากปี 1970 ซึ่งมี yiled มากกว่าหรือเท่ากับ 2.75 และมีความสูง (height) น้อยกว่าหนึ่ง 1?
มีข้อมูลที่แถวที่มาจาก location ที่ชื่อว่า “Lawes” หรือ มีโปรตีนมากกว่า 40?

4.2 `arrange()`

คำสั่ง arrange คล้ายคลึงกับคำสั่ง sort ใน Excel เราสามารถเรียงลำดับแถวตามค่าในคอลัมน์ได้ดังนี้

arrange(aus.soy, loc)

## # A tibble: 464 x 10
##    env   loc         year gen   yield height lodging  size protein   oil
##    <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 B70   Brookstead  1970 G01   1.25    1.01    3.25  8.85    39.5  18.8
##  2 B70   Brookstead  1970 G02   1.17    1.13    2.75  8.9     38.6  19.8
##  3 B70   Brookstead  1970 G03   0.468   1.16    2.25 10.8     37.8  20.4
##  4 B70   Brookstead  1970 G04   1.44    1.24    1.5  10.6     38.7  20.4
##  5 B70   Brookstead  1970 G05   1.34    1.12    2    12.0     37.8  20.8
##  6 B70   Brookstead  1970 G06   0.913   1.10    2.25 11       37.4  19.9
##  7 B70   Brookstead  1970 G07   1.24    1.13    2    10.2     37.8  20.3
##  8 B70   Brookstead  1970 G08   0.385   1.12    2.25  6.15    38.5  17.9
##  9 B70   Brookstead  1970 G09   1.11    1.04    1.75  8.3     37.9  20.0
## 10 B70   Brookstead  1970 G10   1.80    1.04    2    11.8     38.4  19.7
## # … with 454 more rows

คำสั่งด้านบนจะเรียงลำดับแถวตาม loc ตามตัวอักษร (A->Z) ถ้าเราต้องการกลับกันให้ใส่คำสั่ง desc() ไว้รอบ ๆ ชื่อตัวแปร

arrange(aus.soy, desc(loc))

## # A tibble: 464 x 10
##    env   loc         year gen   yield height lodging  size protein   oil
##    <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 R70   RedlandBay  1970 G01   0.778  0.9      3.25  6.25    40.8  15.9
##  2 R70   RedlandBay  1970 G02   1.09   0.9      3.75  7.35    41.0  17.6
##  3 R70   RedlandBay  1970 G03   1.58   0.835    2.25  8.5     39.8  20.0
##  4 R70   RedlandBay  1970 G04   1.99   1.00     2.5   8.85    40.0  21.1
##  5 R70   RedlandBay  1970 G05   2.04   0.85     2.25  9.65    37.4  21.4
##  6 R70   RedlandBay  1970 G06   1.35   0.93     3     9.3     38    19.9
##  7 R70   RedlandBay  1970 G07   1.81   0.9      2.25  7.85    37.7  20.7
##  8 R70   RedlandBay  1970 G08   1.16   0.89     1.75  7.9     39.5  19.1
##  9 R70   RedlandBay  1970 G09   1.93   0.93     2.25  8.65    38.2  19.8
## 10 R70   RedlandBay  1970 G10   1.52   0.85     2.75 10.3     39.4  20.0
## # … with 454 more rows

เราสามารถเรียงจากหลาย ๆ คอลัมมน์พร้อมกันได้ โดยใส่ชื่อเพิ่มเข้าไปเรื่อย ๆ โดยจะเรียงตามลำดับที่เราใส่เข้าไป

arrange(aus.soy, loc, gen, yield)

## # A tibble: 464 x 10
##    env   loc         year gen   yield height lodging  size protein   oil
##    <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 B70   Brookstead  1970 G01   1.25   1.01     3.25  8.85    39.5  18.8
##  2 B71   Brookstead  1971 G01   2.56   0.965    3.25  8.9     40.3  18.2
##  3 B70   Brookstead  1970 G02   1.17   1.13     2.75  8.9     38.6  19.8
##  4 B71   Brookstead  1971 G02   2.34   0.835    2.75 10.9     41.8  18.8
##  5 B70   Brookstead  1970 G03   0.468  1.16     2.25 10.8     37.8  20.4
##  6 B71   Brookstead  1971 G03   2.87   0.98     3.25 12.4     39.8  19.6
##  7 B70   Brookstead  1970 G04   1.44   1.24     1.5  10.6     38.7  20.4
##  8 B71   Brookstead  1971 G04   3.01   0.9      2.25 12.8     38.5  19.5
##  9 B70   Brookstead  1970 G05   1.34   1.12     2    12.0     37.8  20.8
## 10 B71   Brookstead  1971 G05   2.54   0.975    3.25 13.3     39.4  19.9
## # … with 454 more rows

arrange(aus.soy, loc, yield, gen)

## # A tibble: 464 x 10
##    env   loc         year gen   yield height lodging  size protein   oil
##    <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 B70   Brookstead  1970 G08   0.385   1.12    2.25  6.15    38.5  17.9
##  2 B70   Brookstead  1970 G30   0.424   1.07    2.5   5.75    40.2  16.5
##  3 B70   Brookstead  1970 G52   0.466   0.56    1.25 10.2     40.4  17.4
##  4 B70   Brookstead  1970 G03   0.468   1.16    2.25 10.8     37.8  20.4
##  5 B70   Brookstead  1970 G13   0.577   1.25    3     6.7     41.8  17.2
##  6 B70   Brookstead  1970 G19   0.595   1.20    2.75  6.65    43    17.4
##  7 B70   Brookstead  1970 G58   0.617   0.66    1    16.5     42.1  19.4
##  8 B70   Brookstead  1970 G29   0.779   1.20    2.5   7.8     40.4  18.0
##  9 B70   Brookstead  1970 G12   0.793   1.10    3     5.4     43.6  16.8
## 10 B70   Brookstead  1970 G36   0.814   1.10    3.25  6.2     41.0  16.6
## # … with 454 more rows

ลำดับของชื่อตัวแปรมีผลหรือไม่

Exercise

ลองใช้คำสั่ง arrange to answer the following questions:

location ไหนมี yield สูงสุดและค่า yield สูงสุดอยู่ที่เท่าไหร่
สายพันธุ์ไหน (gen) ที่มีโปรตีนตำ่สุด ค่าดังกล่าวมากว่าจะปีไหน

4.3 `select()`

เราใช้ select() เลือกเฉพาะคอลัมน์ที่เราต้องการ

select(aus.soy, env, loc, gen)

## # A tibble: 464 x 3
##    env   loc   gen  
##    <chr> <chr> <chr>
##  1 L70   Lawes G01  
##  2 L70   Lawes G02  
##  3 L70   Lawes G03  
##  4 L70   Lawes G04  
##  5 L70   Lawes G05  
##  6 L70   Lawes G06  
##  7 L70   Lawes G07  
##  8 L70   Lawes G08  
##  9 L70   Lawes G09  
## 10 L70   Lawes G10  
## # … with 454 more rows

หรือเอาคอลัมน์ที่เราไม่ต้องการออก โดยการใส่เครื่องหมาย - ไปด้านหน้า

select(aus.soy, -env, -loc)

## # A tibble: 464 x 8
##     year gen   yield height lodging  size protein   oil
##    <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1  1970 G01    2.39  1.44     4.25  8.45    36.7  20.9
##  2  1970 G02    2.28  1.45     4.25  9.95    37.6  20.7
##  3  1970 G03    2.57  1.46     3.75 10.8     37.8  21.3
##  4  1970 G04    2.88  1.26     3.5  10.0     38.4  22.0
##  5  1970 G05    2.39  1.34     3.5  11       37.5  22.1
##  6  1970 G06    2.41  1.36     4    11.8     38.2  21.2
##  7  1970 G07    2.70  1.3      3    11.8     37.4  21.7
##  8  1970 G08    2.46  0.955    3.25 10       35.2  21.1
##  9  1970 G09    2.57  1.03     3    11.2     35.9  21.5
## 10  1970 G10    2.98  1.16     3.75 10.8     39.7  20.4
## # … with 454 more rows

หรือจะใช้ในการเรียงคอลัมน์ใหม่

select(aus.soy, loc, env, gen)

## # A tibble: 464 x 3
##    loc   env   gen  
##    <chr> <chr> <chr>
##  1 Lawes L70   G01  
##  2 Lawes L70   G02  
##  3 Lawes L70   G03  
##  4 Lawes L70   G04  
##  5 Lawes L70   G05  
##  6 Lawes L70   G06  
##  7 Lawes L70   G07  
##  8 Lawes L70   G08  
##  9 Lawes L70   G09  
## 10 Lawes L70   G10  
## # … with 454 more rows

หรือจะเรียงลำดับคอลัมน์ใหม่แล้วให้ที่เหลือเหมือนเดิม

select(aus.soy, loc, year, env, everything())

## # A tibble: 464 x 10
##    loc    year env   gen   yield height lodging  size protein   oil
##    <chr> <dbl> <chr> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 Lawes  1970 L70   G01    2.39  1.44     4.25  8.45    36.7  20.9
##  2 Lawes  1970 L70   G02    2.28  1.45     4.25  9.95    37.6  20.7
##  3 Lawes  1970 L70   G03    2.57  1.46     3.75 10.8     37.8  21.3
##  4 Lawes  1970 L70   G04    2.88  1.26     3.5  10.0     38.4  22.0
##  5 Lawes  1970 L70   G05    2.39  1.34     3.5  11       37.5  22.1
##  6 Lawes  1970 L70   G06    2.41  1.36     4    11.8     38.2  21.2
##  7 Lawes  1970 L70   G07    2.70  1.3      3    11.8     37.4  21.7
##  8 Lawes  1970 L70   G08    2.46  0.955    3.25 10       35.2  21.1
##  9 Lawes  1970 L70   G09    2.57  1.03     3    11.2     35.9  21.5
## 10 Lawes  1970 L70   G10    2.98  1.16     3.75 10.8     39.7  20.4
## # … with 454 more rows

ทริกขั้นเทพ เรายังสามารถเลือกคอลัมน์ได้ตามลักษณะของคอลัมน์ด้วยคำสั่ง select_if() เช่นถ้าต้องการเลือกเฉพาะคอลัมน์ที่เป็นตัวเลข

select_if(aus.soy, is.numeric)

## # A tibble: 464 x 7
##     year yield height lodging  size protein   oil
##    <dbl> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1  1970  2.39  1.44     4.25  8.45    36.7  20.9
##  2  1970  2.28  1.45     4.25  9.95    37.6  20.7
##  3  1970  2.57  1.46     3.75 10.8     37.8  21.3
##  4  1970  2.88  1.26     3.5  10.0     38.4  22.0
##  5  1970  2.39  1.34     3.5  11       37.5  22.1
##  6  1970  2.41  1.36     4    11.8     38.2  21.2
##  7  1970  2.70  1.3      3    11.8     37.4  21.7
##  8  1970  2.46  0.955    3.25 10       35.2  21.1
##  9  1970  2.57  1.03     3    11.2     35.9  21.5
## 10  1970  2.98  1.16     3.75 10.8     39.7  20.4
## # … with 454 more rows

หรือต้องการเลือกคอลัมน์ที่ชื่อขึ้นต้นด้วย “y” หรือ มีตัวอักษร “p” อยู่ในนั้น ซึ่งจะมีประโยชน์มาก เวลาเรามีคอลัมน์เยอะ ๆ

select(aus.soy, starts_with("y"))

## # A tibble: 464 x 2
##     year yield
##    <dbl> <dbl>
##  1  1970  2.39
##  2  1970  2.28
##  3  1970  2.57
##  4  1970  2.88
##  5  1970  2.39
##  6  1970  2.41
##  7  1970  2.70
##  8  1970  2.46
##  9  1970  2.57
## 10  1970  2.98
## # … with 454 more rows

select(aus.soy, contains("g"))

## # A tibble: 464 x 3
##    gen   height lodging
##    <chr>  <dbl>   <dbl>
##  1 G01    1.44     4.25
##  2 G02    1.45     4.25
##  3 G03    1.46     3.75
##  4 G04    1.26     3.5 
##  5 G05    1.34     3.5 
##  6 G06    1.36     4   
##  7 G07    1.3      3   
##  8 G08    0.955    3.25
##  9 G09    1.03     3   
## 10 G10    1.16     3.75
## # … with 454 more rows

Exercise

ลองเขียน code ที่ สั้นที่สุด ที่ให้ตารางที่มี 8 คอลัมน์ โดยตัวแปรเรียงลำดับตามนี้ loc, year, gen, yield, lodging, size, protein, oil

4.4 `mutate()`

คำสั่งที่ผ่านมายังไม่ได้เปลี่ยนแปลงอะไรเกี่ยวกับตัวข้อมูลเลย แต่คำสั่ง mutate นี้จะช่วยสร้างคอลัมน์ให้เรา โดยอาศัยข้อมูลที่มีอยู่แล้วในคอลัมน์เดิม หรือ เพิ่มข้อมูลใหม่เข้าไปก็ได้ i

ตัวอย่างเช่น height มีหน่วยเป็นเมตร เราสามารถเพิ่มคอลัมน์ใหม่ของ height ที่มีหน่วยเป็นเซนติเมตรโดยการ x 100 จากคอลัมน์เดิม

mutate(aus.soy, height.cm = height*100)

## # A tibble: 464 x 11
##    env   loc    year gen   yield height lodging  size protein   oil
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L70   Lawes  1970 G01    2.39  1.44     4.25  8.45    36.7  20.9
##  2 L70   Lawes  1970 G02    2.28  1.45     4.25  9.95    37.6  20.7
##  3 L70   Lawes  1970 G03    2.57  1.46     3.75 10.8     37.8  21.3
##  4 L70   Lawes  1970 G04    2.88  1.26     3.5  10.0     38.4  22.0
##  5 L70   Lawes  1970 G05    2.39  1.34     3.5  11       37.5  22.1
##  6 L70   Lawes  1970 G06    2.41  1.36     4    11.8     38.2  21.2
##  7 L70   Lawes  1970 G07    2.70  1.3      3    11.8     37.4  21.7
##  8 L70   Lawes  1970 G08    2.46  0.955    3.25 10       35.2  21.1
##  9 L70   Lawes  1970 G09    2.57  1.03     3    11.2     35.9  21.5
## 10 L70   Lawes  1970 G10    2.98  1.16     3.75 10.8     39.7  20.4
## # … with 454 more rows, and 1 more variable: height.cm <dbl>

เรายังสามารถเพิ่มหลายคอลัมน์ได้พร้อม ๆ กัน โดยการใส่เพิ่มเข้าไปหลัง ,

ในกรณีนี้จะลองเลือกคอลัมม์ให้น่้อยลงก่อน (จะได้เห็นผลง่าย ๆ ) และจะสร้างคอลัมน์ใหม่สำหรับ protein:oil ratio and height.cm พร้อมๆกัน

aus.soy2 <- select(aus.soy, height, protein, oil)
mutate(aus.soy2, height.cm = height*100, ratio = protein/oil)

## # A tibble: 464 x 5
##    height protein   oil height.cm ratio
##     <dbl>   <dbl> <dbl>     <dbl> <dbl>
##  1  1.44     36.7  20.9     144.   1.76
##  2  1.45     37.6  20.7     145    1.81
##  3  1.46     37.8  21.3     146    1.78
##  4  1.26     38.4  22.0     126    1.75
##  5  1.34     37.5  22.1     134.   1.69
##  6  1.36     38.2  21.2     136    1.81
##  7  1.3      37.4  21.7     130    1.72
##  8  0.955    35.2  21.1      95.5  1.66
##  9  1.03     35.9  21.5     103    1.67
## 10  1.16     39.7  20.4     116.   1.94
## # … with 454 more rows

เราสามารถใช้คำสั่งอื่น ๆ มาร่วมในการสร้างคอลัมน์ใหม่ได้

aus.soy3 <- aus.soy2[1:10,]
aus.soy4 <- mutate(aus.soy3, rank.oil = rank(oil))
aus.soy4

## # A tibble: 10 x 4
##    height protein   oil rank.oil
##     <dbl>   <dbl> <dbl>    <dbl>
##  1  1.44     36.7  20.9        3
##  2  1.45     37.6  20.7        2
##  3  1.46     37.8  21.3        6
##  4  1.26     38.4  22.0        9
##  5  1.34     37.5  22.1       10
##  6  1.36     38.2  21.2        5
##  7  1.3      37.4  21.7        8
##  8  0.955    35.2  21.1        4
##  9  1.03     35.9  21.5        7
## 10  1.16     39.7  20.4        1

arrange(aus.soy4, oil)

## # A tibble: 10 x 4
##    height protein   oil rank.oil
##     <dbl>   <dbl> <dbl>    <dbl>
##  1  1.16     39.7  20.4        1
##  2  1.45     37.6  20.7        2
##  3  1.44     36.7  20.9        3
##  4  0.955    35.2  21.1        4
##  5  1.36     38.2  21.2        5
##  6  1.46     37.8  21.3        6
##  7  1.03     35.9  21.5        7
##  8  1.3      37.4  21.7        8
##  9  1.26     38.4  22.0        9
## 10  1.34     37.5  22.1       10

Exercise

ลองสร้างคอลัมน์ใหม่ที่ชื่อว่า protein.prop ซึ่งเปลี่ยนข้อมูล protein จากปัจจุบันที่เป็นร้อยละอยู่ให้เป็นสัดส่วนโดยการหาร 100 จากคอลัมน์โปรตีนเดิม

4.5 `summarize()`

summarize function creates an entirely new dataset from the summary function. For example, you can find the mean of height of the whole data table:

summarize(aus.soy, mean.height = mean(height))

## # A tibble: 1 x 1
##   mean.height
##         <dbl>
## 1       0.883

More advanced, you can summaize all the columns that are numeric, using summarize_if()

summarize_if(aus.soy, is.numeric, mean)

## # A tibble: 1 x 7
##    year yield height lodging  size protein   oil
##   <dbl> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1 1970.  2.05  0.883    2.31  11.1    40.3  19.9

The summarize() function by itself is not very useful, because ideally you want to creat a summary based on certain grouping, such as year or genotype. The function group_by() can help with this. Let’s do grouping by year.

aus.soy.by.year <- group_by(aus.soy, year)
aus.soy.by.year

## # A tibble: 464 x 10
## # Groups:   year [2]
##    env   loc    year gen   yield height lodging  size protein   oil
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L70   Lawes  1970 G01    2.39  1.44     4.25  8.45    36.7  20.9
##  2 L70   Lawes  1970 G02    2.28  1.45     4.25  9.95    37.6  20.7
##  3 L70   Lawes  1970 G03    2.57  1.46     3.75 10.8     37.8  21.3
##  4 L70   Lawes  1970 G04    2.88  1.26     3.5  10.0     38.4  22.0
##  5 L70   Lawes  1970 G05    2.39  1.34     3.5  11       37.5  22.1
##  6 L70   Lawes  1970 G06    2.41  1.36     4    11.8     38.2  21.2
##  7 L70   Lawes  1970 G07    2.70  1.3      3    11.8     37.4  21.7
##  8 L70   Lawes  1970 G08    2.46  0.955    3.25 10       35.2  21.1
##  9 L70   Lawes  1970 G09    2.57  1.03     3    11.2     35.9  21.5
## 10 L70   Lawes  1970 G10    2.98  1.16     3.75 10.8     39.7  20.4
## # … with 454 more rows

The data table is the same, but now the heading change to

# A tibble: 464 x 10

# Groups: year [2]

which means that the table now recognizes the group by year. We can now do summary by year.

summarize(aus.soy.by.year, mean.oil = mean(oil), sd.oil = sd(oil))

## # A tibble: 2 x 3
##    year mean.oil sd.oil
##   <dbl>    <dbl>  <dbl>
## 1  1970     20.5   2.84
## 2  1971     19.3   2.37

5. Pipe Operation

The group_by and summary is very powerful, but it can be annoying having to assign new object all the times to do something. To solve this problem, we can use the pipe operation to string all the command together without having to retype everything. For example, let’s do the same thing again here:

aus.soy %>% 
  group_by(year) %>% 
  summarize(mean.oil = mean(oil), sd.oil = sd(oil))

## # A tibble: 2 x 3
##    year mean.oil sd.oil
##   <dbl>    <dbl>  <dbl>
## 1  1970     20.5   2.84
## 2  1971     19.3   2.37

%>% (keyboard shortcut: Ctrl+Shift+M) is called “pipe”. It forward the object to the function and the results of the function to the next line over. This way, you can execute the next function without having to retype the object name.

Here is an example of how pipe can help you work with data very quickly. I want to calculate the mean and sd of oil at all locations, but only in year 1970.

aus.soy %>% 
  filter(year == 1970) %>% 
  group_by(loc) %>% 
  summarize(mean.oil = mean(oil), sd.oil = sd(oil))

## # A tibble: 4 x 3
##   loc        mean.oil sd.oil
##   <chr>         <dbl>  <dbl>
## 1 Brookstead     19.5   2.16
## 2 Lawes          20.7   1.73
## 3 Nambour        22.7   1.65
## 4 RedlandBay     19.0   3.76

You can also do two level of grouping. For example, grouping data by loc and then year.

aus.soy %>% 
  group_by(loc, year) %>% 
  summarize(mean.oil = mean(oil), sd.oil = sd(oil))

## # A tibble: 8 x 4
## # Groups:   loc [4]
##   loc         year mean.oil sd.oil
##   <chr>      <dbl>    <dbl>  <dbl>
## 1 Brookstead  1970     19.5   2.16
## 2 Brookstead  1971     18.9   2.26
## 3 Lawes       1970     20.7   1.73
## 4 Lawes       1971     19.7   2.42
## 5 Nambour     1970     22.7   1.65
## 6 Nambour     1971     20.0   2.09
## 7 RedlandBay  1970     19.0   3.76
## 8 RedlandBay  1971     18.8   2.49

Exercise

In the code chunk, create a summary of average and standard deviation of protein content by year and loc. Then filter it out to only include “Brookstead” and “Laws”. Use pipe in your code.

6. [Advanced] Additional verbs for manipulation

6.1 `separate()` and `unite()`

There will be times that you want to combine your column or seperate them. This can easily be done by the functions seperate and unite.

First you can unite two column to create a new column by unite. For example you want combine the loc and year together so that you can see them quickly.

aus.soy.u <- aus.soy %>% 
  select(loc, year, gen) %>% 
  unite("loc_year", c("loc", "year"))

aus.soy.u

## # A tibble: 464 x 2
##    loc_year   gen  
##    <chr>      <chr>
##  1 Lawes_1970 G01  
##  2 Lawes_1970 G02  
##  3 Lawes_1970 G03  
##  4 Lawes_1970 G04  
##  5 Lawes_1970 G05  
##  6 Lawes_1970 G06  
##  7 Lawes_1970 G07  
##  8 Lawes_1970 G08  
##  9 Lawes_1970 G09  
## 10 Lawes_1970 G10  
## # … with 454 more rows

Note that this way. The previous columns of loc and year are now gone. If you want to preserve those column, you will have to use mutate instead.

aus.soy %>% 
  select(loc, year, gen) %>% 
  mutate(loc_year = paste(loc, year,sep = "_"))

## # A tibble: 464 x 4
##    loc    year gen   loc_year  
##    <chr> <dbl> <chr> <chr>     
##  1 Lawes  1970 G01   Lawes_1970
##  2 Lawes  1970 G02   Lawes_1970
##  3 Lawes  1970 G03   Lawes_1970
##  4 Lawes  1970 G04   Lawes_1970
##  5 Lawes  1970 G05   Lawes_1970
##  6 Lawes  1970 G06   Lawes_1970
##  7 Lawes  1970 G07   Lawes_1970
##  8 Lawes  1970 G08   Lawes_1970
##  9 Lawes  1970 G09   Lawes_1970
## 10 Lawes  1970 G10   Lawes_1970
## # … with 454 more rows

You can do the opposite with separate()

aus.soy.u %>% 
  separate(loc_year, c("loc", "year"))

## # A tibble: 464 x 3
##    loc   year  gen  
##    <chr> <chr> <chr>
##  1 Lawes 1970  G01  
##  2 Lawes 1970  G02  
##  3 Lawes 1970  G03  
##  4 Lawes 1970  G04  
##  5 Lawes 1970  G05  
##  6 Lawes 1970  G06  
##  7 Lawes 1970  G07  
##  8 Lawes 1970  G08  
##  9 Lawes 1970  G09  
## 10 Lawes 1970  G10  
## # … with 454 more rows

6.2 `pivot_wider()` and `pivot_longer()`

The data aus.soy that we have right now is in the wide format with a lot of variables a column sometimes we want to have them in the long form format, so that we can do plotting, so summary better. For this, we use the function pivot_longer() to change the data into a long form.

Here we will turn all the measurements into a long-form column.

Let’s look at the data in the wide format first.

aus.soy

## # A tibble: 464 x 10
##    env   loc    year gen   yield height lodging  size protein   oil
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L70   Lawes  1970 G01    2.39  1.44     4.25  8.45    36.7  20.9
##  2 L70   Lawes  1970 G02    2.28  1.45     4.25  9.95    37.6  20.7
##  3 L70   Lawes  1970 G03    2.57  1.46     3.75 10.8     37.8  21.3
##  4 L70   Lawes  1970 G04    2.88  1.26     3.5  10.0     38.4  22.0
##  5 L70   Lawes  1970 G05    2.39  1.34     3.5  11       37.5  22.1
##  6 L70   Lawes  1970 G06    2.41  1.36     4    11.8     38.2  21.2
##  7 L70   Lawes  1970 G07    2.70  1.3      3    11.8     37.4  21.7
##  8 L70   Lawes  1970 G08    2.46  0.955    3.25 10       35.2  21.1
##  9 L70   Lawes  1970 G09    2.57  1.03     3    11.2     35.9  21.5
## 10 L70   Lawes  1970 G10    2.98  1.16     3.75 10.8     39.7  20.4
## # … with 454 more rows

Now we tell the command to create a new column named “measurment” to store the existing variable names, and value to store the numbers, and do this with column

aus.soy.l <- aus.soy %>% 
  pivot_longer(cols = yield:oil, names_to = "measurement", values_to = "value")

aus.soy.l

## # A tibble: 2,784 x 6
##    env   loc    year gen   measurement value
##    <chr> <chr> <dbl> <chr> <chr>       <dbl>
##  1 L70   Lawes  1970 G01   yield        2.39
##  2 L70   Lawes  1970 G01   height       1.44
##  3 L70   Lawes  1970 G01   lodging      4.25
##  4 L70   Lawes  1970 G01   size         8.45
##  5 L70   Lawes  1970 G01   protein     36.7 
##  6 L70   Lawes  1970 G01   oil         20.9 
##  7 L70   Lawes  1970 G02   yield        2.28
##  8 L70   Lawes  1970 G02   height       1.45
##  9 L70   Lawes  1970 G02   lodging      4.25
## 10 L70   Lawes  1970 G02   size         9.95
## # … with 2,774 more rows

With this we can use the group_by function to create a summary (mean, sd) by loc and type of measurement very easily.

aus.soy.summary <- aus.soy.l %>% 
  group_by(measurement, loc,year) %>% 
  summarize(mean = mean(value), sd = sd(value)) 
  
aus.soy.summary

## # A tibble: 48 x 5
## # Groups:   measurement, loc [24]
##    measurement loc         year  mean    sd
##    <chr>       <chr>      <dbl> <dbl> <dbl>
##  1 height      Brookstead  1970 1.02  0.186
##  2 height      Brookstead  1971 0.949 0.179
##  3 height      Lawes       1970 1.22  0.293
##  4 height      Lawes       1971 0.819 0.210
##  5 height      Nambour     1970 0.670 0.176
##  6 height      Nambour     1971 0.633 0.170
##  7 height      RedlandBay  1970 0.897 0.216
##  8 height      RedlandBay  1971 0.862 0.216
##  9 lodging     Brookstead  1970 2.35  0.756
## 10 lodging     Brookstead  1971 2.69  0.699
## # … with 38 more rows

or plot a nice multiple panel figure here (we will learn this in details next week).

ggplot(aus.soy.summary, aes(loc, mean, fill = loc)) +
  geom_bar(stat = "identity") +
  facet_grid(year ~ measurement, scale = "free_x") +
  coord_flip()

Conversely, if you have data in a long-form and find it difficult to look at, you can always convert it back to the wide form with pivot_wider.

aus.soy.l %>% 
  pivot_wider(names_from = "measurement", values_from = "value")

## # A tibble: 464 x 10
##    env   loc    year gen   yield height lodging  size protein   oil
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L70   Lawes  1970 G01    2.39  1.44     4.25  8.45    36.7  20.9
##  2 L70   Lawes  1970 G02    2.28  1.45     4.25  9.95    37.6  20.7
##  3 L70   Lawes  1970 G03    2.57  1.46     3.75 10.8     37.8  21.3
##  4 L70   Lawes  1970 G04    2.88  1.26     3.5  10.0     38.4  22.0
##  5 L70   Lawes  1970 G05    2.39  1.34     3.5  11       37.5  22.1
##  6 L70   Lawes  1970 G06    2.41  1.36     4    11.8     38.2  21.2
##  7 L70   Lawes  1970 G07    2.70  1.3      3    11.8     37.4  21.7
##  8 L70   Lawes  1970 G08    2.46  0.955    3.25 10       35.2  21.1
##  9 L70   Lawes  1970 G09    2.57  1.03     3    11.2     35.9  21.5
## 10 L70   Lawes  1970 G10    2.98  1.16     3.75 10.8     39.7  20.4
## # … with 454 more rows

7. Write out CSV file.

Anything you do here in R can be exported out as a spreadsheet (CSV file). Here we will write our last summary into a CSV file. They will appear in your folder.

write_csv(aus.soy.summary, "aus_soy_summary.csv")

การจัดการข้อมูลด้วย tidyverse ใน​ R

เอกพันธ์ ไกรจักร์

อันนี้เป็นฉบับร่างนะครับ ภาษาไทยยังไม่สมบูรณ์ จะนำมาเติมให้เต็มภายหลังนะครับ

0. Loading Tidyverse

1. Importing Data

2. Preview the data and tibble format

3. Subsetting data by number

Exercise

4. Data manipulation: Five main verbs

4.1 filter()

Exercise

4.2 arrange()

Exercise

4.3 select()

Exercise

4.4 mutate()

Exercise

4.5 summarize()

5. Pipe Operation

Exercise

6. [Advanced] Additional verbs for manipulation

6.1 separate() and unite()

6.2 pivot_wider() and pivot_longer()

7. Write out CSV file.

การจัดการข้อมูลด้วย tidyverse ใน R

4.1 `filter()`

4.2 `arrange()`

4.3 `select()`

4.4 `mutate()`

4.5 `summarize()`

6.1 `separate()` and `unite()`

6.2 `pivot_wider()` and `pivot_longer()`