Data Manipulation in R: Updated 2022

ในบทเรียนนี้ เราจะมีเรียนรู้วิธีการจัดการข้อมูล (data wrangling; data manipulation) กันนะครับ เพราะข้อมูลดิบมักจะไม่ได้มาในรูปแบบที่เราอยากได้ เราเลยจะต้องมีวิธีการจัดการกับข้อมูลดิลให้อยู่ในรูปแบบที่เราจะนำไปใช้ต่อด้วยครับ

0. Loading The package Tidyverse

ก่อนจะเริ่มเราต้องอย่าลืมเรียกชุดคำสั่ง (library) ก่อน เพราะคำสั่งที่เราใช้หลาย ๆ อย่าง อยู่ในชุดคำสั่งนี้ครับ

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

1. Importing Data

ขั้นตอนแรกที่สุดก็คือการนำเข้าข้อมูลจากไฟล์ .csv มาสู่โปรแกรม R นะครับ โดยจะใช้คำสั่ง read_csv ในการนำเข้าครับ ข้อมูลนี้เป็นข้อมูลถั่วเหลืองที่ทำการทดลองปลูกในประเทศออสเตรเลียในปี 1970-1971 ใน 4 พื้นที่ เป็นข้อมูลดัดแปลงมาจาก https://three-mode.leidenuniv.nl/data/soybeaninf.htm (ดูรายละเอียดด้านล่าง)

soy <- read_csv("australian_soybean_small.csv")

## Rows: 24 Columns: 10

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): env, loc, gen
## dbl (7): year, yield, height, lodging, size, protein, oil

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

โดยคำสั่งด้านบนนี้ ตัวฟังก์ชัน read_csv จะไปตามหาไฟล์ที่ชื่อว่า australian_soybean_small.csv มาให้เราแล้วอ่านในโปรแกรม R แต่อ่านเฉย ๆ ก็จะไม่สามารถนำไปใช้ต่อได้ จึงจะต้องนำไปเก็บใน object ใหม่ ในกรณีนี้ผมให้ชื่อว่า soy โดยจะนำไปเก็บได้ จะต้องใช้เครื่องหมาย assign <- นำไปเก็บให้เรา

ในตอนนี้สิ่งที่ฟังก์ชั่น read_csv อ่านได้ จะได้นำไปเก็บแล้วใน object ที่ชื่อ soy ครับ เรามาดูกันดีกว่าใน object ชื่อ soy หน้าตาเป็นอย่างไรบ้าง

2. Preview the data and tibble format

soy

## # A tibble: 24 × 10
##    env   loc         year gen   yield height lodging  size protein   oil
##    <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L70   Lawes       1970 G01   2.39   1.44     4.25  8.45    36.7  20.9
##  2 L70   Lawes       1970 G02   2.28   1.45     4.25  9.95    37.6  20.7
##  3 L70   Lawes       1970 G03   2.57   1.46     3.75 10.8     37.8  21.3
##  4 B70   Brookstead  1970 G01   1.25   1.01     3.25  8.85    39.5  18.8
##  5 B70   Brookstead  1970 G02   1.17   1.13     2.75  8.9     38.6  19.8
##  6 B70   Brookstead  1970 G03   0.468  1.16     2.25 10.8     37.8  20.4
##  7 N70   Nambour     1970 G01   2.26   0.75     2.25  9.25    34.2  22.3
##  8 N70   Nambour     1970 G02   2.16   0.71     2     9.35    37.6  22.4
##  9 N70   Nambour     1970 G03   2.35   0.685    1.5  11.9     35.5  23.7
## 10 R70   RedlandBay  1970 G01   0.778  0.9      3.25  6.25    40.8  15.9
## # … with 14 more rows

จะเห็นว่าในขั้นตอนนี้ object จะแสดงผลให้เห็นหลายอย่าง

# A tibble: 24 × 10 ซึ่งแปลว่าตารางที่มี 24 แถว (rows หรือ observation) และ 10 คอลัมน์ แต่ว่า เขาจะแสดงให้เราเห็นเพียงแค่ 10 แถวแรก เพื่อให้เราพอเห็นภาพว่าข้อมูลเราหน้าตาเป็นอย่างไร เป็นเบื้องต้น

ถ้าต้องการให้แสดงผลให้เห็นทั้งหมด สามารถใช้คำสั่ง print แล้วระบุจำนวนบรรทัดที่ต้องการให้แสดงผล ผ่าน argument n= ในกรณีนี้ต้องการให้แสดงทั้งหมด ให้ใส่ n = "all" หรือ ถ้าต้องการระบุจำนวนบรรทัด ก็ระบุได้ เช่น n=20 คือ ใช้แสดงผล 20 บรรทัด ซึ่งจริง ๆ ใน soy ของเรานี้ก็มี 24 แถวอยู่แล้ว เพียงแต่ว่าเราต้องการเห็นตัวเลขกี่บรรทัดเท่านั้นเอง

print(soy, n = "all")

## # A tibble: 24 × 10
##    env   loc         year gen   yield height lodging  size protein   oil
##    <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L70   Lawes       1970 G01   2.39   1.44     4.25  8.45    36.7  20.9
##  2 L70   Lawes       1970 G02   2.28   1.45     4.25  9.95    37.6  20.7
##  3 L70   Lawes       1970 G03   2.57   1.46     3.75 10.8     37.8  21.3
##  4 B70   Brookstead  1970 G01   1.25   1.01     3.25  8.85    39.5  18.8
##  5 B70   Brookstead  1970 G02   1.17   1.13     2.75  8.9     38.6  19.8
##  6 B70   Brookstead  1970 G03   0.468  1.16     2.25 10.8     37.8  20.4
##  7 N70   Nambour     1970 G01   2.26   0.75     2.25  9.25    34.2  22.3
##  8 N70   Nambour     1970 G02   2.16   0.71     2     9.35    37.6  22.4
##  9 N70   Nambour     1970 G03   2.35   0.685    1.5  11.9     35.5  23.7
## 10 R70   RedlandBay  1970 G01   0.778  0.9      3.25  6.25    40.8  15.9
## 11 R70   RedlandBay  1970 G02   1.09   0.9      3.75  7.35    41.0  17.6
## 12 R70   RedlandBay  1970 G03   1.58   0.835    2.25  8.5     39.8  20.0
## 13 L71   Lawes       1971 G01   2.79   0.97     2.5   8.35    39.4  18.7
## 14 L71   Lawes       1971 G02   2.62   0.785    2.25  9.75    40.6  19.8
## 15 L71   Lawes       1971 G03   2.48   0.955    2.25 11.4     37.4  20.9
## 16 B71   Brookstead  1971 G01   2.56   0.965    3.25  8.9     40.3  18.2
## 17 B71   Brookstead  1971 G02   2.34   0.835    2.75 10.9     41.8  18.8
## 18 B71   Brookstead  1971 G03   2.87   0.98     3.25 12.4     39.8  19.6
## 19 N71   Nambour     1971 G01   2.10   0.725    1.25  7.3     38.5  19.3
## 20 N71   Nambour     1971 G02   2.6    0.595    1     9.45    41.0  20.7
## 21 N71   Nambour     1971 G03   2.53   0.61     1    11.3     41.6  19.9
## 22 R71   RedlandBay  1971 G01   1.18   0.97     2.25  6.75    42.1  16.9
## 23 R71   RedlandBay  1971 G02   1.90   0.74     1.75  7.95    42.7  18.9
## 24 R71   RedlandBay  1971 G03   2.02   0.835    1.5   9.1     39.8  19.7

บรรทัดแรกหลังจาก # A tibble: 24 × 10 จะเป็นชื่อคอลัมน์ เช่น env loc year เป็นต้น และบรรทัดต่อจากชื่อคอลัมน์ที่อยู่ในวงเล็บสามเหลี่ยม <chr> หรือ <dbl> ส่วนนี้จะบอก ชนิดของตัวแปรในแต่ละคอลัมน์ เช่น env เป็น <chr> แปลว่าเป็นตัวแปรแบบ character คือ เป็นแบบตัวอักษรนั่นเอง ส่วน year เป็น dbl แปลว่าเป็นตัวแปรแบบ double หรือ แบบตัวเลขนั่นเอง จะขออธิบายตัวแปรต่าง ๆ คร่าว ๆ ดังนี้นะครับ

env คือ สิ่งแวดล้อมที่ปลูก เป็นรหัสย่อของสถานที่ปลูก loc รวมกับปีที่ปลูก (70 หรือ 71)
loc คือ สถานที่ปลูก (locations) มี 4 สถานที่คือ Lawes, Brookstead, Nambour, RedlandBay
year คือ ปีที่ปลูก (1970 หรือ 1971)
gen คือ จีโนไทป์ที่ใช้ปลูก มี 3 จีโนไทป์ (G01, G02, G03)
yield คือ ปริมาณผลผลิตที่ได้ หน่วย ตันต่อเฮคแตร์
height คือ ความสูงของต้น หน่วย เมตร
lodging คือ ค่าเฉลี่ยอัตราการหักโค่นของต้น อยู่ระหว่าง 1(หักน้อย) -5 (หักมาก)
size คือ ขนาดของเมล็ด หน่วย มิลลิเมตร
protein คือ ร้อยละของโปรตีนที่พบในเมล็ด
oil คือ ร้อยละของน้ำมันที่พบในเมล็ด

3. Subsetting data - traditionally

การ subset ข้อมูล (data subsetting) คือการเลือกบางส่วนของข้อมูลมาใช้งาน ซึ่งในการใช้ R พื้นฐาน เราสามารถใช้เครื่อง [ , ] (square bracket วงเล็บสี่เหลี่ยม) ในการเลือกแถวหรือคอลัมน์ได้ ตัวเลขที่อยู่ด้านหน้า , คือ เลขระบุแถวที่เราต้องการเลือก และตัวเลขด้านหลัง คือ ตัวเลขที่ระบุคอลัมน์ที่เราต้องการเลือก

ตัวอย่างเช่น ถ้าเราต้องการเลือกแค่แถวที่ 1 และทุกคอลัมน์ เราจะใส่ เลข 1 ไว้ด้านหน้า , และปล่อยหลัง , ว่างไว้ แปลว่า เลือกทุกคอลัมน์ดังนี้

soy[1, ]

## # A tibble: 1 × 10
##   env   loc    year gen   yield height lodging  size protein   oil
##   <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1 L70   Lawes  1970 G01    2.39   1.44    4.25  8.45    36.7  20.9

เราสามารถเลือกหลายแถวได้ด้วยเช่นกัน โดยการใช้ สัญลักษณ์ : เช่น 1:5 หมายถึง แถว 1 ถึง แถว 5 หรือ ถ้าเราจะเลือกแถวไม่เรียงกันสามารถใช้คำสั่ง c() เพื่อเอาตัวเลขมาต่อกันได้เช่น c(1,3,5) แปลว่า แถว 1, 3 และ 5 ดังนี้

soy[1:5, ]

## # A tibble: 5 × 10
##   env   loc         year gen   yield height lodging  size protein   oil
##   <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1 L70   Lawes       1970 G01    2.39   1.44    4.25  8.45    36.7  20.9
## 2 L70   Lawes       1970 G02    2.28   1.45    4.25  9.95    37.6  20.7
## 3 L70   Lawes       1970 G03    2.57   1.46    3.75 10.8     37.8  21.3
## 4 B70   Brookstead  1970 G01    1.25   1.01    3.25  8.85    39.5  18.8
## 5 B70   Brookstead  1970 G02    1.17   1.13    2.75  8.9     38.6  19.8

soy[c(1,3,5), ]

## # A tibble: 3 × 10
##   env   loc         year gen   yield height lodging  size protein   oil
##   <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1 L70   Lawes       1970 G01    2.39   1.44    4.25  8.45    36.7  20.9
## 2 L70   Lawes       1970 G03    2.57   1.46    3.75 10.8     37.8  21.3
## 3 B70   Brookstead  1970 G02    1.17   1.13    2.75  8.9     38.6  19.8

สำหรับการเลือกคอลัมน์ เราสามารถใช้วิธีการระบุตัวเลข ที่ด้านหลัง , หรือ ใช้ชื่อคอลัมน์ได้เลย หลังเครื่องหมาย $

soy[,1]

## # A tibble: 24 × 1
##    env  
##    <chr>
##  1 L70  
##  2 L70  
##  3 L70  
##  4 B70  
##  5 B70  
##  6 B70  
##  7 N70  
##  8 N70  
##  9 N70  
## 10 R70  
## # … with 14 more rows

soy$env

##  [1] "L70" "L70" "L70" "B70" "B70" "B70" "N70" "N70" "N70" "R70" "R70" "R70"
## [13] "L71" "L71" "L71" "B71" "B71" "B71" "N71" "N71" "N71" "R71" "R71" "R71"

และสามารถเลือกหลาย ๆ คอลัมน์ได้พร้อม ๆ กัน โดยใช้ฟังก์ชั่น c() เช่นเดิม

soy[,c(1,3,5)]

## # A tibble: 24 × 3
##    env    year yield
##    <chr> <dbl> <dbl>
##  1 L70    1970 2.39 
##  2 L70    1970 2.28 
##  3 L70    1970 2.57 
##  4 B70    1970 1.25 
##  5 B70    1970 1.17 
##  6 B70    1970 0.468
##  7 N70    1970 2.26 
##  8 N70    1970 2.16 
##  9 N70    1970 2.35 
## 10 R70    1970 0.778
## # … with 14 more rows

soy[,c("loc","year")]

## # A tibble: 24 × 2
##    loc         year
##    <chr>      <dbl>
##  1 Lawes       1970
##  2 Lawes       1970
##  3 Lawes       1970
##  4 Brookstead  1970
##  5 Brookstead  1970
##  6 Brookstead  1970
##  7 Nambour     1970
##  8 Nambour     1970
##  9 Nambour     1970
## 10 RedlandBay  1970
## # … with 14 more rows

ด้วยการใช้เครื่องหมาย [ , ] เราสามารถใช้เลือกทั้ง แถว และ คอลัมน์ไปได้พร้อม ๆ กันเลย

soy[1:5,1:3]

## # A tibble: 5 × 3
##   env   loc         year
##   <chr> <chr>      <dbl>
## 1 L70   Lawes       1970
## 2 L70   Lawes       1970
## 3 L70   Lawes       1970
## 4 B70   Brookstead  1970
## 5 B70   Brookstead  1970

คำสั่งที่เห็นด้านบนนี้จะเลือกแถวที่ 1 ถึง แถวที่ 5 และคอลัมน์ที่ 1 ถึง คอลัมน์ที่ 3 ไปพร้อม ๆ กัน ทำให้เราได้ตารางมีขนาด 5x3 แทน

จะสังเกตว่าที่เราทำมาทั้งหมดนี้เป็นแค่การย่อยข้อมูลแล้ว แสดงผลเท่านั้น ถ้าเราต้องการจะนำข้อมูลที่ย่อยแล้วไปใช้ต่อ เราจะต้อง assign <- ไปเก็บใน object ชื่อใหม่ดังตัวอย่างข้างล่างนี้

soy_1 <- soy[1:5, 5:6]
soy_1

## # A tibble: 5 × 2
##   yield height
##   <dbl>  <dbl>
## 1  2.39   1.44
## 2  2.28   1.45
## 3  2.57   1.46
## 4  1.25   1.01
## 5  1.17   1.13

หรือ ถ้าเราขี้เกียจตั้งชื่อตัวแปรใหม่ และต้องการคำนวณอะไรเร็ว ๆ เราสามารถใช้คำสั่งในการย่อยข้อมูลใส่เข้าไปในคำสั่งอื่น ๆ ได้เลย เช่นเราอยากรู้ว่า ค่าเฉลี่ยของ yield ในข้อมูล soy_1 เป็นเท่าไหร่ เราสามารถใช้การ subset แล้วใส่เข้าไปในฟังก์ชัน mean() ได้เลย

mean(soy_1$yield)

## [1] 1.9312

sd(soy_1$height)

## [1] 0.2116896

4. Data manipulation: Main verbs -tidy style

ถ้า code ที่ใช้ย่อยข้อมูลข้างต้นทำให้คุณงง คุณไม่ใช่คนแรกแน่นอนครับ ผู้ใช้ R หลายท่านก็หงุดหงิดในความพิศดารของ code ด้านบน เช่นกัน จึงมีการเขียนชุดคำสั่งใหม่ขึ้นมาเรียกว่า dplyr ซึ่งเป็นส่วนหนึ่งของจักรวาล tidyverse ที่ทำให้การย่อยข้อมูลทำได้สะดวกสบายมากขึ้น โดยใช้คำกริยา (verb) เข้ามาเชื่อมต่อกับข้อมูลของเรา ซึ่งทำให้เราเห็นชัดเจนมากขึ้นว่า เราทำอะไรกับข้อมูลของเราอยู่ เรามาทำความรู้จักกับคำกริยาที่สำคัญ ๆ กันนะครับ

4.1 `filter()`

filter() เลือก แถว ตามเงื่อนไขที่เราที่เราให้ไปครับ ลักษณะนี้จะคล้ายกับฟังก์ชั่น filter ใน Excel เลย

filter ตามคอลัมน์ที่เป็นตัวเลข

ในตัวอย่างนี้เราจะเลือกกรองเฉพาะข้อมูลจากปี 1970 จึงต้อง filter จากคอลัมน์ year แล้วระบุว่า “เท่ากับ” หรือ == (มี 2 อันนะครับ เป็นเครื่องหมายเฉพาะ) แล้วใส่ตัวเลขตามไปได้เลย

soy %>% 
  filter(year == 1970)

## # A tibble: 12 × 10
##    env   loc         year gen   yield height lodging  size protein   oil
##    <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L70   Lawes       1970 G01   2.39   1.44     4.25  8.45    36.7  20.9
##  2 L70   Lawes       1970 G02   2.28   1.45     4.25  9.95    37.6  20.7
##  3 L70   Lawes       1970 G03   2.57   1.46     3.75 10.8     37.8  21.3
##  4 B70   Brookstead  1970 G01   1.25   1.01     3.25  8.85    39.5  18.8
##  5 B70   Brookstead  1970 G02   1.17   1.13     2.75  8.9     38.6  19.8
##  6 B70   Brookstead  1970 G03   0.468  1.16     2.25 10.8     37.8  20.4
##  7 N70   Nambour     1970 G01   2.26   0.75     2.25  9.25    34.2  22.3
##  8 N70   Nambour     1970 G02   2.16   0.71     2     9.35    37.6  22.4
##  9 N70   Nambour     1970 G03   2.35   0.685    1.5  11.9     35.5  23.7
## 10 R70   RedlandBay  1970 G01   0.778  0.9      3.25  6.25    40.8  15.9
## 11 R70   RedlandBay  1970 G02   1.09   0.9      3.75  7.35    41.0  17.6
## 12 R70   RedlandBay  1970 G03   1.58   0.835    2.25  8.5     39.8  20.0

วิธีการอ่านด้านบนนี้คือ บรรทัดที่ 1 เรามีข้อมูล soy ที่เราจะส่งต่อไป (%>%) ที่คำสั่ง filter ในบรรทัดที่ 2 ให้กรอง ตามคอลัมน์ year ที่เท่ากับ 1970

filter ตามคอลัมน์ที่เป็นอักษร

ทำงานเหมือนคอลัมน์ที่เป็นตัวเลขเลยครับ แต่ว่า ตัวอักษรที่เราต้องการกรอง ต้องใส่ฟันหนู (quotation marks) " " ให้เค้าสักนิดนึงครับ

soy %>% 
  filter(gen == "G01")

## # A tibble: 8 × 10
##   env   loc         year gen   yield height lodging  size protein   oil
##   <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1 L70   Lawes       1970 G01   2.39   1.44     4.25  8.45    36.7  20.9
## 2 B70   Brookstead  1970 G01   1.25   1.01     3.25  8.85    39.5  18.8
## 3 N70   Nambour     1970 G01   2.26   0.75     2.25  9.25    34.2  22.3
## 4 R70   RedlandBay  1970 G01   0.778  0.9      3.25  6.25    40.8  15.9
## 5 L71   Lawes       1971 G01   2.79   0.97     2.5   8.35    39.4  18.7
## 6 B71   Brookstead  1971 G01   2.56   0.965    3.25  8.9     40.3  18.2
## 7 N71   Nambour     1971 G01   2.10   0.725    1.25  7.3     38.5  19.3
## 8 R71   RedlandBay  1971 G01   1.18   0.97     2.25  6.75    42.1  16.9

filter ตามคอลัมน์ที่เป็นอักษร หลาย ๆ อัน

ถ้าเรามีเงื่อนไขให้เหลือหลายอัน เช่นว่าต้องการเลือกจาก 2 locations จะไม่สามารถใช้ == ได้ จะต้องใช้ %in% แทนครับ ซึ่งอ่านแปลได้ว่า loc ที่ตรงกับ “Lawes” หรือ “Brookstead” ก็ได้

soy %>% 
  filter(loc %in% c("Lawes","Brookstead"))

## # A tibble: 12 × 10
##    env   loc         year gen   yield height lodging  size protein   oil
##    <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L70   Lawes       1970 G01   2.39   1.44     4.25  8.45    36.7  20.9
##  2 L70   Lawes       1970 G02   2.28   1.45     4.25  9.95    37.6  20.7
##  3 L70   Lawes       1970 G03   2.57   1.46     3.75 10.8     37.8  21.3
##  4 B70   Brookstead  1970 G01   1.25   1.01     3.25  8.85    39.5  18.8
##  5 B70   Brookstead  1970 G02   1.17   1.13     2.75  8.9     38.6  19.8
##  6 B70   Brookstead  1970 G03   0.468  1.16     2.25 10.8     37.8  20.4
##  7 L71   Lawes       1971 G01   2.79   0.97     2.5   8.35    39.4  18.7
##  8 L71   Lawes       1971 G02   2.62   0.785    2.25  9.75    40.6  19.8
##  9 L71   Lawes       1971 G03   2.48   0.955    2.25 11.4     37.4  20.9
## 10 B71   Brookstead  1971 G01   2.56   0.965    3.25  8.9     40.3  18.2
## 11 B71   Brookstead  1971 G02   2.34   0.835    2.75 10.9     41.8  18.8
## 12 B71   Brookstead  1971 G03   2.87   0.98     3.25 12.4     39.8  19.6

เชื่อมเงื่อนไข

เราสามารถกำหนดมากกว่า 1 เงื่อนไขระหว่าง filter ได้ โดยใช้สัญลักษณ์ต่าง ๆ ดังนี้

AND (และ) คือต้องตรงทุกเงื่อนไข ใช้ &

ในคำสั่งด้านล่างนี้ เราจะกรองข้อมูลที่มีปริมาณน้ำมันมากกว่า 20 และ ปีที่ ไม่ใช่ 1970 เครื่องหมาย != มีความหมายว่า ไม่เท่ากับ

soy %>% 
  filter(oil > 20 & year != 1970)

## # A tibble: 2 × 10
##   env   loc      year gen   yield height lodging  size protein   oil
##   <chr> <chr>   <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1 L71   Lawes    1971 G03    2.48  0.955    2.25 11.4     37.4  20.9
## 2 N71   Nambour  1971 G02    2.6   0.595    1     9.45    41.0  20.7

OR (หรือ) คือตรงเงื่อนไขใดเงื่อนไขหนึ่ง ใช้ | (เครื่องหมายขีดตรง แป้น ฅ ไม่ใช่ตัว แอลเล็ก หรือ ตัวไอใหญ่)

ในคำสั่งด้านล่างนี้ เราจะกรองข้อมูลที่มีปริมาณน้ำมันมากกว่า 20 หรือ ปีที่ ไม่ใช่ 1970 แปลว่าอาจจะมีข้อมูลที่มีน้ำมันน้อยกว่า 20 แต่ไม่ใช่ 1970 มาด้วย (ตรงกับเงื่อนไขหลัง)

soy %>% 
  filter(oil > 20 | year != 1970)

## # A tibble: 19 × 10
##    env   loc         year gen   yield height lodging  size protein   oil
##    <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L70   Lawes       1970 G01   2.39   1.44     4.25  8.45    36.7  20.9
##  2 L70   Lawes       1970 G02   2.28   1.45     4.25  9.95    37.6  20.7
##  3 L70   Lawes       1970 G03   2.57   1.46     3.75 10.8     37.8  21.3
##  4 B70   Brookstead  1970 G03   0.468  1.16     2.25 10.8     37.8  20.4
##  5 N70   Nambour     1970 G01   2.26   0.75     2.25  9.25    34.2  22.3
##  6 N70   Nambour     1970 G02   2.16   0.71     2     9.35    37.6  22.4
##  7 N70   Nambour     1970 G03   2.35   0.685    1.5  11.9     35.5  23.7
##  8 L71   Lawes       1971 G01   2.79   0.97     2.5   8.35    39.4  18.7
##  9 L71   Lawes       1971 G02   2.62   0.785    2.25  9.75    40.6  19.8
## 10 L71   Lawes       1971 G03   2.48   0.955    2.25 11.4     37.4  20.9
## 11 B71   Brookstead  1971 G01   2.56   0.965    3.25  8.9     40.3  18.2
## 12 B71   Brookstead  1971 G02   2.34   0.835    2.75 10.9     41.8  18.8
## 13 B71   Brookstead  1971 G03   2.87   0.98     3.25 12.4     39.8  19.6
## 14 N71   Nambour     1971 G01   2.10   0.725    1.25  7.3     38.5  19.3
## 15 N71   Nambour     1971 G02   2.6    0.595    1     9.45    41.0  20.7
## 16 N71   Nambour     1971 G03   2.53   0.61     1    11.3     41.6  19.9
## 17 R71   RedlandBay  1971 G01   1.18   0.97     2.25  6.75    42.1  16.9
## 18 R71   RedlandBay  1971 G02   1.90   0.74     1.75  7.95    42.7  18.9
## 19 R71   RedlandBay  1971 G03   2.02   0.835    1.5   9.1     39.8  19.7

4.2 `arrange()`

กริยาตัวที่สองคือ arrange เป็นการเรียงข้อมูลเหมือนกับ sort ใน Excel ถ้าโดยคำสั่งปกติจะเรียงจาก น้อย ไป มาก ดังตัวอย่างด้านล่างที่จะเรียงข้อมูลตาม yield จากน้อยไปมาก

soy %>% 
  arrange(yield)

## # A tibble: 24 × 10
##    env   loc         year gen   yield height lodging  size protein   oil
##    <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 B70   Brookstead  1970 G03   0.468  1.16     2.25 10.8     37.8  20.4
##  2 R70   RedlandBay  1970 G01   0.778  0.9      3.25  6.25    40.8  15.9
##  3 R70   RedlandBay  1970 G02   1.09   0.9      3.75  7.35    41.0  17.6
##  4 B70   Brookstead  1970 G02   1.17   1.13     2.75  8.9     38.6  19.8
##  5 R71   RedlandBay  1971 G01   1.18   0.97     2.25  6.75    42.1  16.9
##  6 B70   Brookstead  1970 G01   1.25   1.01     3.25  8.85    39.5  18.8
##  7 R70   RedlandBay  1970 G03   1.58   0.835    2.25  8.5     39.8  20.0
##  8 R71   RedlandBay  1971 G02   1.90   0.74     1.75  7.95    42.7  18.9
##  9 R71   RedlandBay  1971 G03   2.02   0.835    1.5   9.1     39.8  19.7
## 10 N71   Nambour     1971 G01   2.10   0.725    1.25  7.3     38.5  19.3
## # … with 14 more rows

ถ้าต้องการเรียงจากมากไปน้อย ให้ใส่ฟังก์ชั่น desc() ครอบชื่อคอลัมน์อีกทีหนึ่ง

soy %>% 
  arrange(desc(yield))

## # A tibble: 24 × 10
##    env   loc         year gen   yield height lodging  size protein   oil
##    <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 B71   Brookstead  1971 G03    2.87  0.98     3.25 12.4     39.8  19.6
##  2 L71   Lawes       1971 G01    2.79  0.97     2.5   8.35    39.4  18.7
##  3 L71   Lawes       1971 G02    2.62  0.785    2.25  9.75    40.6  19.8
##  4 N71   Nambour     1971 G02    2.6   0.595    1     9.45    41.0  20.7
##  5 L70   Lawes       1970 G03    2.57  1.46     3.75 10.8     37.8  21.3
##  6 B71   Brookstead  1971 G01    2.56  0.965    3.25  8.9     40.3  18.2
##  7 N71   Nambour     1971 G03    2.53  0.61     1    11.3     41.6  19.9
##  8 L71   Lawes       1971 G03    2.48  0.955    2.25 11.4     37.4  20.9
##  9 L70   Lawes       1970 G01    2.39  1.44     4.25  8.45    36.7  20.9
## 10 N70   Nambour     1970 G03    2.35  0.685    1.5  11.9     35.5  23.7
## # … with 14 more rows

เราสามารถเรียงหลายตามหลายคอลัมน์พร้อม ๆ กันได้ เช่น ให้เรียงตาม location ก่อน แล้วตามด้วย genotype

soy %>% 
  arrange(loc, gen)

## # A tibble: 24 × 10
##    env   loc         year gen   yield height lodging  size protein   oil
##    <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 B70   Brookstead  1970 G01   1.25   1.01     3.25  8.85    39.5  18.8
##  2 B71   Brookstead  1971 G01   2.56   0.965    3.25  8.9     40.3  18.2
##  3 B70   Brookstead  1970 G02   1.17   1.13     2.75  8.9     38.6  19.8
##  4 B71   Brookstead  1971 G02   2.34   0.835    2.75 10.9     41.8  18.8
##  5 B70   Brookstead  1970 G03   0.468  1.16     2.25 10.8     37.8  20.4
##  6 B71   Brookstead  1971 G03   2.87   0.98     3.25 12.4     39.8  19.6
##  7 L70   Lawes       1970 G01   2.39   1.44     4.25  8.45    36.7  20.9
##  8 L71   Lawes       1971 G01   2.79   0.97     2.5   8.35    39.4  18.7
##  9 L70   Lawes       1970 G02   2.28   1.45     4.25  9.95    37.6  20.7
## 10 L71   Lawes       1971 G02   2.62   0.785    2.25  9.75    40.6  19.8
## # … with 14 more rows

หรือ ถ้าเราสลับตำแหน่งกัน ลองดูสิครับว่าผลจะเป็นแบบใด

soy %>% 
  arrange(gen, loc)

## # A tibble: 24 × 10
##    env   loc         year gen   yield height lodging  size protein   oil
##    <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 B70   Brookstead  1970 G01   1.25   1.01     3.25  8.85    39.5  18.8
##  2 B71   Brookstead  1971 G01   2.56   0.965    3.25  8.9     40.3  18.2
##  3 L70   Lawes       1970 G01   2.39   1.44     4.25  8.45    36.7  20.9
##  4 L71   Lawes       1971 G01   2.79   0.97     2.5   8.35    39.4  18.7
##  5 N70   Nambour     1970 G01   2.26   0.75     2.25  9.25    34.2  22.3
##  6 N71   Nambour     1971 G01   2.10   0.725    1.25  7.3     38.5  19.3
##  7 R70   RedlandBay  1970 G01   0.778  0.9      3.25  6.25    40.8  15.9
##  8 R71   RedlandBay  1971 G01   1.18   0.97     2.25  6.75    42.1  16.9
##  9 B70   Brookstead  1970 G02   1.17   1.13     2.75  8.9     38.6  19.8
## 10 B71   Brookstead  1971 G02   2.34   0.835    2.75 10.9     41.8  18.8
## # … with 14 more rows

จะเห็นว่าโค้ดอันหลังจะเรียง gen ก่อน แล้วจึงค่อยเรียง location อีกทีหนึ่ง

4.3 `slice()`

เป็นคำสั่งคล้าย ๆ filter ครับ แต่อันนี้จะเลือกแถว (row) ตามตำแหน่ง (เลขที่แถว) ที่เราต้องการ

soy %>% 
  slice(1:5)

## # A tibble: 5 × 10
##   env   loc         year gen   yield height lodging  size protein   oil
##   <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1 L70   Lawes       1970 G01    2.39   1.44    4.25  8.45    36.7  20.9
## 2 L70   Lawes       1970 G02    2.28   1.45    4.25  9.95    37.6  20.7
## 3 L70   Lawes       1970 G03    2.57   1.46    3.75 10.8     37.8  21.3
## 4 B70   Brookstead  1970 G01    1.25   1.01    3.25  8.85    39.5  18.8
## 5 B70   Brookstead  1970 G02    1.17   1.13    2.75  8.9     38.6  19.8

คำสั่งที่มีประโยชน์มาก ๆ เลยคือ slice_max() และ slice_min() คือเป็นคำสั่งที่ให้เลือกแถวที่มากที่สุด หรือ น้อยที่สุด ตามค่าในคอลัมน์ เช่น คำสั่งด้านล่างนี้จะให้เราเลือกแถวที่มี yield สูงที่สุด 3 อันดับแรก (n=3)

soy %>% 
  slice_max(yield, n = 3)

## # A tibble: 3 × 10
##   env   loc         year gen   yield height lodging  size protein   oil
##   <chr> <chr>      <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1 B71   Brookstead  1971 G03    2.87  0.98     3.25 12.4     39.8  19.6
## 2 L71   Lawes       1971 G01    2.79  0.97     2.5   8.35    39.4  18.7
## 3 L71   Lawes       1971 G02    2.62  0.785    2.25  9.75    40.6  19.8

4.4 `select()`

กริยา select() ใช้ในการเลือกคอลัมน์ (แตกต่างจาก filter ที่เลือกแถว) ตามชื่อคอลัมน์ เช่น ด้านล่างที่เราเลือกเฉพาะคอลัมน์ชื่อ env, loc, gen แค่ 3 คอลัมน์

soy %>% 
  select(env, loc, gen)

## # A tibble: 24 × 3
##    env   loc        gen  
##    <chr> <chr>      <chr>
##  1 L70   Lawes      G01  
##  2 L70   Lawes      G02  
##  3 L70   Lawes      G03  
##  4 B70   Brookstead G01  
##  5 B70   Brookstead G02  
##  6 B70   Brookstead G03  
##  7 N70   Nambour    G01  
##  8 N70   Nambour    G02  
##  9 N70   Nambour    G03  
## 10 R70   RedlandBay G01  
## # … with 14 more rows

หรือ “เลือกที่จะไม่เอา” คอลัมน์ไหน เราก็ใส่เครื่องหมาย - ไว้ด้านหน้า คอลัมน์เหล่านั้นก็จะหายไป

soy %>% 
  select( -env, -loc)

## # A tibble: 24 × 8
##     year gen   yield height lodging  size protein   oil
##    <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1  1970 G01   2.39   1.44     4.25  8.45    36.7  20.9
##  2  1970 G02   2.28   1.45     4.25  9.95    37.6  20.7
##  3  1970 G03   2.57   1.46     3.75 10.8     37.8  21.3
##  4  1970 G01   1.25   1.01     3.25  8.85    39.5  18.8
##  5  1970 G02   1.17   1.13     2.75  8.9     38.6  19.8
##  6  1970 G03   0.468  1.16     2.25 10.8     37.8  20.4
##  7  1970 G01   2.26   0.75     2.25  9.25    34.2  22.3
##  8  1970 G02   2.16   0.71     2     9.35    37.6  22.4
##  9  1970 G03   2.35   0.685    1.5  11.9     35.5  23.7
## 10  1970 G01   0.778  0.9      3.25  6.25    40.8  15.9
## # … with 14 more rows

สเต๊ปเทพ (ใครที่เพิ่งเริ่มเรียนสามารถข้ามได้): select() สามารถใช้เลือกคอลัมน์ตามเงื่อนไขได้ด้วย เช่น

ต้องการเลือกเฉพาะคอลัมน์ที่เป็นตัวเลขเท่านั้น

soy %>% 
  select(where(is.numeric))

## # A tibble: 24 × 7
##     year yield height lodging  size protein   oil
##    <dbl> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1  1970 2.39   1.44     4.25  8.45    36.7  20.9
##  2  1970 2.28   1.45     4.25  9.95    37.6  20.7
##  3  1970 2.57   1.46     3.75 10.8     37.8  21.3
##  4  1970 1.25   1.01     3.25  8.85    39.5  18.8
##  5  1970 1.17   1.13     2.75  8.9     38.6  19.8
##  6  1970 0.468  1.16     2.25 10.8     37.8  20.4
##  7  1970 2.26   0.75     2.25  9.25    34.2  22.3
##  8  1970 2.16   0.71     2     9.35    37.6  22.4
##  9  1970 2.35   0.685    1.5  11.9     35.5  23.7
## 10  1970 0.778  0.9      3.25  6.25    40.8  15.9
## # … with 14 more rows

หรือ เลือกเฉพาะคอลัมน์ที่เริ่มต้นด้วย ตัว “y”

soy %>% 
  select(starts_with("y"))

## # A tibble: 24 × 2
##     year yield
##    <dbl> <dbl>
##  1  1970 2.39 
##  2  1970 2.28 
##  3  1970 2.57 
##  4  1970 1.25 
##  5  1970 1.17 
##  6  1970 0.468
##  7  1970 2.26 
##  8  1970 2.16 
##  9  1970 2.35 
## 10  1970 0.778
## # … with 14 more rows

4.5 `mutate()`

กริยานี้เหมือนจะแปลว่า กลายพันธุ์ แต่ในโปรแกรม R จริง ๆ อันนี้แปลว่าสร้างคอลัมน์ใหม่ขึ้นมาเลยครับ เช่น เราอยากสร้าง คอลัมน์ใหม่ขึ้นมาชื่อว่า height_cm โดยคำนวณจากคอลัมน์เดิม height ที่มีหน่วยเป็นเมตร นำมาคูณ 100 ให้กลายเป็นหน่วยเซนติเมตร (* คือ เครื่องหมาย คูณ ใน R) จะเขียนโค้ดได้ดังนี้

soy %>% 
  mutate(height_cm = height*100)

## # A tibble: 24 × 11
##    env   loc       year gen   yield height lodging  size protein   oil height_cm
##    <chr> <chr>    <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>     <dbl>
##  1 L70   Lawes     1970 G01   2.39   1.44     4.25  8.45    36.7  20.9     144. 
##  2 L70   Lawes     1970 G02   2.28   1.45     4.25  9.95    37.6  20.7     145  
##  3 L70   Lawes     1970 G03   2.57   1.46     3.75 10.8     37.8  21.3     146  
##  4 B70   Brookst…  1970 G01   1.25   1.01     3.25  8.85    39.5  18.8     101. 
##  5 B70   Brookst…  1970 G02   1.17   1.13     2.75  8.9     38.6  19.8     113  
##  6 B70   Brookst…  1970 G03   0.468  1.16     2.25 10.8     37.8  20.4     116. 
##  7 N70   Nambour   1970 G01   2.26   0.75     2.25  9.25    34.2  22.3      75  
##  8 N70   Nambour   1970 G02   2.16   0.71     2     9.35    37.6  22.4      71  
##  9 N70   Nambour   1970 G03   2.35   0.685    1.5  11.9     35.5  23.7      68.5
## 10 R70   Redland…  1970 G01   0.778  0.9      3.25  6.25    40.8  15.9      90  
## # … with 14 more rows

mutate() สามารถสร้างคอลัมน์ใหม่หลาย ๆ อันพร้อมกันได้ดังนี้ โดยการสร้างแต่ละอันจะคั่นด้วย ,

soy %>% 
  mutate(height_cm = height*100, 
         PO_ratio = protein/oil)

## # A tibble: 24 × 12
##    env   loc       year gen   yield height lodging  size protein   oil height_cm
##    <chr> <chr>    <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>     <dbl>
##  1 L70   Lawes     1970 G01   2.39   1.44     4.25  8.45    36.7  20.9     144. 
##  2 L70   Lawes     1970 G02   2.28   1.45     4.25  9.95    37.6  20.7     145  
##  3 L70   Lawes     1970 G03   2.57   1.46     3.75 10.8     37.8  21.3     146  
##  4 B70   Brookst…  1970 G01   1.25   1.01     3.25  8.85    39.5  18.8     101. 
##  5 B70   Brookst…  1970 G02   1.17   1.13     2.75  8.9     38.6  19.8     113  
##  6 B70   Brookst…  1970 G03   0.468  1.16     2.25 10.8     37.8  20.4     116. 
##  7 N70   Nambour   1970 G01   2.26   0.75     2.25  9.25    34.2  22.3      75  
##  8 N70   Nambour   1970 G02   2.16   0.71     2     9.35    37.6  22.4      71  
##  9 N70   Nambour   1970 G03   2.35   0.685    1.5  11.9     35.5  23.7      68.5
## 10 R70   Redland…  1970 G01   0.778  0.9      3.25  6.25    40.8  15.9      90  
## # … with 14 more rows, and 1 more variable: PO_ratio <dbl>

4.6 `summarize()`

summarize() เป็นการสร้างตารางใหม่ขึ้นมาเพื่อ สรุปข้อมูล จากข้อมูลดิบ เช่น โค้ดด้านล่าง จะสรุปข้อมูลความสูงของทั้งตารางให้

soy %>% 
  summarize(mean_height = mean(height))

## # A tibble: 1 × 1
##   mean_height
##         <dbl>
## 1       0.933

สเต๊ปเทพ (ใครที่เพิ่งเริ่มเรียนสามารถข้ามได้): summarize() ตามเงื่อนไขที่กำหนดเช่น เฉพาะคอลัมน์ที่เป็นตัวเลข ให้สรุปค่าเฉลี่ยออกมา

soy %>% 
  summarise(across(where(is.numeric), mean))

## # A tibble: 1 × 7
##    year yield height lodging  size protein   oil
##   <dbl> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1 1970.  2.01  0.933    2.52  9.33    39.2  19.8

โดยปกติ เรามักจะใช้ summarize() ในการสรุปข้อมูลตามตัวแปรอื่น ๆ โดยใช้อีกคำสั่งคือ group_by() ตัวอย่างเช่นด้านล่างนี้ เราจะสรุปข้อมูลแยกตามปี (year) โดยจะสรุปข้อมูล ค่าเฉลี่ย และ ส่วนเบี่ยงเบนมาตรฐาน (standard deviation; SD) ของ ปริมาณน้ำมัน

soy %>% 
  group_by(year) %>% 
  summarize(mean_oil = mean(oil),
            sd_oil = sd(oil))

## # A tibble: 2 × 3
##    year mean_oil sd_oil
##   <dbl>    <dbl>  <dbl>
## 1  1970     20.3   2.14
## 2  1971     19.3   1.10

เราสามารถกำหนดการแยกข้อมูลด้วยข้อมูลมากกว่า 1 คอลัมน์ได้ด้วย เช่น ให้สรุปข้อมูลแยกตามปี และ สถานที่ปลูก

soy %>% 
  group_by(year, loc) %>% 
  summarize(mean_oil = mean(oil),
            sd_oil = sd(oil))

## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

## # A tibble: 8 × 4
## # Groups:   year [2]
##    year loc        mean_oil sd_oil
##   <dbl> <chr>         <dbl>  <dbl>
## 1  1970 Brookstead     19.7  0.764
## 2  1970 Lawes          21.0  0.286
## 3  1970 Nambour        22.8  0.748
## 4  1970 RedlandBay     17.8  2.04 
## 5  1971 Brookstead     18.9  0.673
## 6  1971 Lawes          19.8  1.13 
## 7  1971 Nambour        20.0  0.690
## 8  1971 RedlandBay     18.5  1.45

5. Putting together multiple verbs

ด้วยพลานุภาพของ pipe หรือ %>% เราสามารถต่อคำกริยาหลาย ๆ ชุดเข้าด้วยกันได้ ดังตัวอย่างนี้ ที่ต้องการให้ดำเนินการตามขั้นตอนต่อไปนี้

นำข้อมูล soy มา….
กรอง (filter) เอาเฉพาะปี 1970 แล้ว….
แบ่งกลุ่ม (group_by) ตามสถานที่ปลูก (location) เพื่อ…..
ทำการสรุป (summarize) หาค่าเฉลี่ย (mean) และ ค่า SD (sd) ของปริมาณน้ำมัน

soy %>% 
  filter(year == 1970) %>% 
  group_by(loc) %>% 
  summarize(mean_oil = mean(oil), sd_oil = sd(oil))

## # A tibble: 4 × 3
##   loc        mean_oil sd_oil
##   <chr>         <dbl>  <dbl>
## 1 Brookstead     19.7  0.764
## 2 Lawes          21.0  0.286
## 3 Nambour        22.8  0.748
## 4 RedlandBay     17.8  2.04

ที่ผ่านมาทั้งหมด เราทำเพื่อแสดงผลอย่างเดียว ถ้าเราต้องการใช้ข้อมูลที่เราสรุปหรือจัดการแล้วไปทำอย่างอื่นต่อ เราจะต้อง assign <- ไปเก็บใน object ใหม่

soy_summary <- soy %>% 
  filter(year == 1970) %>% 
  group_by(loc) %>% 
  summarize(mean.oil = mean(oil), sd.oil = sd(oil))

นี่คือหน้าตาของข้อมูลที่ได้

soy_summary

## # A tibble: 4 × 3
##   loc        mean.oil sd.oil
##   <chr>         <dbl>  <dbl>
## 1 Brookstead     19.7  0.764
## 2 Lawes          21.0  0.286
## 3 Nambour        22.8  0.748
## 4 RedlandBay     17.8  2.04

6. \[สเต๊ปเทพ\] Additional verbs for manipulation

`pivot_wider()` and `pivot_longer()`

ข้อมูลของเราตอนนี้ (soy) เป็นรูปแบบที่เรียกว่า wide form ซึ่งแปลว่าแต่ละเราแถวมีหลาย attributes อยู่ในแถวเดียวครั้ง บางครั้งเราจะไม่อยากให้ข้อมูลอยู่ในรูปแบบนี้ เพราะจะนำไปใช้ลำบาก เราต้องการเปลี่ยนรูปให้ข้อมูลทั้งหมดไปอยู่ในรูปที่เรียกว่า long form

เราสามารถทำได้ด้วยคำสั่งนี้ pivot_longer()

soy_long <- soy %>% 
  pivot_longer(cols = yield:oil, names_to = "measurement", values_to = "value")

soy_long

## # A tibble: 144 × 6
##    env   loc    year gen   measurement value
##    <chr> <chr> <dbl> <chr> <chr>       <dbl>
##  1 L70   Lawes  1970 G01   yield        2.39
##  2 L70   Lawes  1970 G01   height       1.44
##  3 L70   Lawes  1970 G01   lodging      4.25
##  4 L70   Lawes  1970 G01   size         8.45
##  5 L70   Lawes  1970 G01   protein     36.7 
##  6 L70   Lawes  1970 G01   oil         20.9 
##  7 L70   Lawes  1970 G02   yield        2.28
##  8 L70   Lawes  1970 G02   height       1.45
##  9 L70   Lawes  1970 G02   lodging      4.25
## 10 L70   Lawes  1970 G02   size         9.95
## # … with 134 more rows

การที่อยู่ในรูป long-form นี้สามารถนำไปใช้สรุปข้อมูลได้ง่ายขึ้นโดยใช้ group_by ตาม measurement และ loc, yearได้พร้อม ๆ กัน

soy_summary <- soy_long %>% 
  group_by(measurement, loc, year) %>% 
  summarize(mean = mean(value), sd = sd(value))

## `summarise()` has grouped output by 'measurement', 'loc'. You can override using the `.groups` argument.

soy_summary

## # A tibble: 48 × 5
## # Groups:   measurement, loc [24]
##    measurement loc         year  mean      sd
##    <chr>       <chr>      <dbl> <dbl>   <dbl>
##  1 height      Brookstead  1970 1.1   0.0747 
##  2 height      Brookstead  1971 0.927 0.0797 
##  3 height      Lawes       1970 1.45  0.00764
##  4 height      Lawes       1971 0.903 0.103  
##  5 height      Nambour     1970 0.715 0.0328 
##  6 height      Nambour     1971 0.643 0.0711 
##  7 height      RedlandBay  1970 0.878 0.0375 
##  8 height      RedlandBay  1971 0.848 0.116  
##  9 lodging     Brookstead  1970 2.75  0.5    
## 10 lodging     Brookstead  1971 3.08  0.289  
## # … with 38 more rows

ซึ่งจะทำให้วาดกราฟสรุปได้อย่างสวยงามและรวดเร็ว (รายละเอียดเพิ่มเติมในบทเรียนถัดไป)

soy_summary %>% 
  filter(measurement %in% c("oil", "protein")) %>% 
  ggplot(aes(x = loc, y = mean, fill = loc)) +
  geom_bar(stat = "identity") +
  facet_grid(year ~ measurement)

ถ้าเราต้องการเอาข้อมูลกลับไปเป็นแบบเดิม (wide form) ก็สามารถใช้ฟังก์ชั่น pivot_wider ได้ ซึ่งในกรณีนี้เราเลือกที่จะเอาเฉพาะข้อมูลค่าเฉลี่ยกลับขึ้นไป จึงจะต้อง select คอลัมน์ sd ออกก่อน

soy_summary_wide <- soy_summary %>%
  select(-sd) %>% 
  pivot_wider(names_from = "measurement", values_from = "mean")

soy_summary_wide

## # A tibble: 8 × 8
## # Groups:   loc [4]
##   loc         year height lodging   oil protein  size yield
##   <chr>      <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl> <dbl>
## 1 Brookstead  1970  1.1      2.75  19.7    38.6  9.52 0.963
## 2 Brookstead  1971  0.927    3.08  18.9    40.6 10.7  2.59 
## 3 Lawes       1970  1.45     4.08  21.0    37.4  9.75 2.41 
## 4 Lawes       1971  0.903    2.33  19.8    39.1  9.83 2.63 
## 5 Nambour     1970  0.715    1.92  22.8    35.8 10.2  2.26 
## 6 Nambour     1971  0.643    1.08  20.0    40.4  9.35 2.41 
## 7 RedlandBay  1970  0.878    3.08  17.8    40.5  7.37 1.15 
## 8 RedlandBay  1971  0.848    1.83  18.5    41.6  7.93 1.70

7. Write out CSV file.

เมื่อทำภาระกิจต่าง ๆ กับข้อมูลเสร็จเรียบร้อยแล้ว เราสามารถ save ข้อมูลออกมาเป็น csv ไฟล์ที่เปิดใน Excel ได้ด้วย

write_csv(soy_summary_wide, file = "aus_soy_summary.csv")

Data Manipulation in R: Updated 2022

Ekaphan Kraichak

0. Loading The package Tidyverse

1. Importing Data

2. Preview the data and tibble format

3. Subsetting data - traditionally

4. Data manipulation: Main verbs -tidy style

4.1 filter()

4.2 arrange()

4.3 slice()

4.4 select()

4.5 mutate()

4.6 summarize()