Input and merge files: Potthoff-Roy Data

Data

Potthoff and Roy (1964) reported data on a study in 16 boys and 11 girls, who at ages 8, 10, 12, and 14 had the distance (mm) from the center of the pituitary gland to the pteryomaxillary fissure measured. Changes in pituitary-pteryomaxillary distances during growth is important in orthodontic therapy. We consider data from girls only here.

# require(pacman)
pacman::p_load(mice)
data(potthoffroy)
subset(potthoffroy, sex=='F')

   id sex   d8  d10  d12  d14
1   1   F 21.0 20.0 21.5 23.0
2   2   F 21.0 21.5 24.0 25.5
3   3   F 20.5 24.0 24.5 26.0
4   4   F 23.5 24.5 25.0 26.5
5   5   F 21.5 23.0 22.5 23.5
6   6   F 20.0 21.0 21.0 22.5
7   7   F 21.5 22.5 23.0 25.0
8   8   F 23.0 23.0 23.5 24.0
9   9   F 20.0 21.0 22.0 21.5
10 10   F 16.5 19.0 19.0 19.5
11 11   F 24.5 25.0 28.0 28.0

dir.create(file.path(getwd(), "./tmp_data"), showWarnings=FALSE)

dir.create為創造一個資料夾名字叫tep_data在設定的工作目錄getwd()

lapply(list, func)，針對potthoffroy column 3 to 6 套入funtion中的語法，將資料分為4個csv。

function code means:subset potthoffroy, only use female data, divided to four csv, for example (column1, column3), (column1, column4)…

paste0 give files the output path and each csv name title with f_ (“./tmp_data/f_”), in order to create file name from 1 to 4, use i-2(because i from 3 to 6)

Files in a folder

Through the previous code, We already created 4 separate files in a local folder called tmp_data.

list.files("./tmp_data/", pattern="f_")

[1] "f_1.csv" "f_2.csv" "f_3.csv" "f_4.csv"

list.files讀取tmp_data中所有以“f_”開頭的檔案

The content of the first one looks like this.

read.csv("./tmp_data/f_1.csv")

Now collect the file names.

fls <- list.files(path = "./tmp_data", pattern = "f_")
fls

[1] "f_1.csv" "f_2.csv" "f_3.csv" "f_4.csv"

Remember to give files the full path to their location.

fL <- paste0("./tmp_data/", fls)
fL

[1] "./tmp_data/f_1.csv" "./tmp_data/f_2.csv" "./tmp_data/f_3.csv"
[4] "./tmp_data/f_4.csv"

Input multiple files

Input these files as a list of data frames

ff <- lapply(fL, read.csv)

Merge

We can merge two files by id.

merge(ff[1], ff[2])

   id   d8  d10
1   1 21.0 20.0
2   2 21.0 21.5
3   3 20.5 24.0
4   4 23.5 24.5
5   5 21.5 23.0
6   6 20.0 21.0
7   7 21.5 22.5
8   8 23.0 23.0
9   9 20.0 21.0
10 10 16.5 19.0
11 11 24.5 25.0

Reduce approach

The function Reduce allows us to ‘loop’ through the list of files with our own version of merge called mrg2.

# Roll our own merging function
mrg2 <- function(f1, f2){                                
  merge(f1, f2, by="id")
}
Reduce(mrg2, ff)

   id   d8  d10  d12  d14
1   1 21.0 20.0 21.5 23.0
2   2 21.0 21.5 24.0 25.5
3   3 20.5 24.0 24.5 26.0
4   4 23.5 24.5 25.0 26.5
5   5 21.5 23.0 22.5 23.5
6   6 20.0 21.0 21.0 22.5
7   7 21.5 22.5 23.0 25.0
8   8 23.0 23.0 23.5 24.0
9   9 20.0 21.0 22.0 21.5
10 10 16.5 19.0 19.0 19.5
11 11 24.5 25.0 28.0 28.0

Tidy approach

Instead of ‘merge’, ‘inner_join’ is used; instead of ‘Reduce’, ‘reduce’.

library(tidyverse)
ff %>% reduce(inner_join, by='id')

   id   d8  d10  d12  d14
1   1 21.0 20.0 21.5 23.0
2   2 21.0 21.5 24.0 25.5
3   3 20.5 24.0 24.5 26.0
4   4 23.5 24.5 25.0 26.5
5   5 21.5 23.0 22.5 23.5
6   6 20.0 21.0 21.0 22.5
7   7 21.5 22.5 23.0 25.0
8   8 23.0 23.0 23.5 24.0
9   9 20.0 21.0 22.0 21.5
10 10 16.5 19.0 19.0 19.5
11 11 24.5 25.0 28.0 28.0

Vertical direction

We can ‘bind’ the input in the vertical direction to construct an output in the long format in contrast to the wide format of the original data.

# extract number of observations from components of the list
n <- sapply(ff, function(x) dim(x)[1])

sapply用法語lapply相似，針對ff這個list files透過dim(x)可以知道每個list結構的row and column,dim(x)[1]為row總數, dim(x)[2]為column總數

# extract numbers from year variable
p <- sapply(ff, function(x) names(x)[2]) %>% parse_number()

針對ff這個list files透過names(x)[2]擷取每一個list裡第二個column name, 並使用parse_number擷取column name裡的數字(for example d8→8, d10→10)

# convert list of data frames to matrices
ff <- lapply(ff, as.matrix)

augment data with a new column variable ‘year’

# approch1: we know the dimensions of the initial data
dtaL <- cbind(Reduce(rbind, ff), 
              year=rep(c(8,10,12,14), c(11,11,11,11))) %>% as.data.frame()

# approach2: binding data-we do not  know the dimensions of the initial data
dtaL2 <- cbind(Reduce(rbind, ff), year=rep(p, n)) %>% as.data.frame() 
# rename the second column
names(dtaL)[2] <- "pp_distance"
names(dtaL2)[2] <- "pp_distance"

Use Reduce to ‘loop’ through the list of files(ff) with rbind fuction.

Approch1:construct year variable, 給定值分別為11個8,10,12,14

Approch2:use p, n to replace the number we gave in previous code. p=c(8,10,12,14), n=c(11,11,11,11)

head(dtaL, 13)

   id pp_distance year
1   1        21.0    8
2   2        21.0    8
3   3        20.5    8
4   4        23.5    8
5   5        21.5    8
6   6        20.0    8
7   7        21.5    8
8   8        23.0    8
9   9        20.0    8
10 10        16.5    8
11 11        24.5    8
12  1        20.0   10
13  2        21.5   10

Exercises

Repeat the above using the male data only.
Repeat the above using both data from males and females