Merging data is a fundamental operation in data analysis that allows you to combine datasets from different sources or tables into a single dataset based on common variables (or “keys”). This operation is commonly needed when working with relational data, where different pieces of information are stored in separate datasets.

Key Concepts of Data Merging

Keys (Joining Variables): The columns on which the merge is based are called the “key” columns. These can be one or more variables that appear in both data frames. The by argument in merge() specifies these key columns.

Types of Joins: Depending on how you want to handle unmatched rows, there are several types of joins:

Inner Join: Keeps only rows with matching keys in both datasets.

Left Join (Left Outer Join): Keeps all rows from the left dataset and only matching rows from the right dataset.

Right Join (Right Outer Join): Keeps all rows from the right dataset and only matching rows from the left dataset.

Full Join (Full Outer Join): Keeps all rows from both datasets, filling with NA where there is no match.

library(haven)

1 Method 1: `merge()`

In R, the merge() function is the most commonly used method to combine datasets. It allows you to merge two data frames by matching values of one or more columns (keys) between the two datasets.

The merge() function provides a flexible way to perform these operations and can handle various scenarios, including mismatched data or missing values.

1.1 Inner Join

data1=data.frame('name'=c('AGUS','BUDI','CHANTIKA','DEDE'),
                 'age'=c(21,23,21,20))
print(data1)

##       name age
## 1     AGUS  21
## 2     BUDI  23
## 3 CHANTIKA  21
## 4     DEDE  20

data2=data.frame('name'=c('AGUS','BOEDI','CANTIKA','DEDE','EMAN'),
                 'marks'=c(21,23,21,20,33))
print(data2)

##      name marks
## 1    AGUS    21
## 2   BOEDI    23
## 3 CANTIKA    21
## 4    DEDE    20
## 5    EMAN    33

EXAMPLE 1:

print(merge(data1, data2, by='name', all.x=FALSE))

##   name age marks
## 1 AGUS  21    21
## 2 DEDE  20    20

The above table illustrates the merging process of the two data frames. As you can see based on the table, the two data frames have the names AGUS dan DEDE in common. The other names were therefore not kept in the joined data frame. In statistical research literature, this type of merging is often called inner join.

print(merge(data1, data2, by='name'))

##   name age marks
## 1 AGUS  21    21
## 2 DEDE  20    20

If all.x = TRUE is not specified, it defaults to all.x = FALSE

EXAMPLE 2:

In an intersection, because all = FALSE

df1 <- data.frame(ID = c(1, 2, 3, 4),
                  Name = c("JUNAEDI", "KIKI", "BAGAS", "ENDANG"),
                  Age = c(60, 30, 25, 40))

df2 <- data.frame(ID = c(2, 3, 4, 5),
                  Name = c("KIKI", "BAGAS", "ENDANG",'JOSHUA'),
                  Age = c(30, 25, 40, 35),
                  Gender = c("Male", "Female", "Male", "Female"))

merged_df <- merge(df1, df2, by = "ID", all = FALSE)
print(merged_df)

##   ID Name.x Age.x Name.y Age.y Gender
## 1  2   KIKI    30   KIKI    30   Male
## 2  3  BAGAS    25  BAGAS    25 Female
## 3  4 ENDANG    40 ENDANG    40   Male

EXAMPLE 3:

merged_df <- merge(df1, df2, by = c("ID", "Name"), all = FALSE)
print(merged_df)

##   ID   Name Age.x Age.y Gender
## 1  2   KIKI    30    30   Male
## 2  3  BAGAS    25    25 Female
## 3  4 ENDANG    40    40   Male

1.2 Left Join

The merge function provides the options all.x and all.y. These two options can be used to retain certain rows of your input data tables, even when no match is found for the merging. This type of merging is often called left join.

Keep all rows of x-data

The data on the left is used as the base because all.x = TRUE

print(merge(data1, data2, by='name', all.x=TRUE))

##       name age marks
## 1     AGUS  21    21
## 2     BUDI  23    NA
## 3 CHANTIKA  21    NA
## 4     DEDE  20    20

merged_df <- merge(df1, df2, by = "ID", all.x = TRUE)
print(merged_df)

##   ID  Name.x Age.x Name.y Age.y Gender
## 1  1 JUNAEDI    60   <NA>    NA   <NA>
## 2  2    KIKI    30   KIKI    30   Male
## 3  3   BAGAS    25  BAGAS    25 Female
## 4  4  ENDANG    40 ENDANG    40   Male

1.3 Right Join

Keep all rows of y-data

print(merge(data1, data2, by='name', all.y=TRUE))

##      name age marks
## 1    AGUS  21    21
## 2   BOEDI  NA    23
## 3 CANTIKA  NA    21
## 4    DEDE  20    20
## 5    EMAN  NA    33

merged_df <- merge(df1, df2, by = "ID", all.y=TRUE)
print(merged_df)

##   ID Name.x Age.x Name.y Age.y Gender
## 1  2   KIKI    30   KIKI    30   Male
## 2  3  BAGAS    25  BAGAS    25 Female
## 3  4 ENDANG    40 ENDANG    40   Male
## 4  5   <NA>    NA JOSHUA    35 Female

1.4 Full Join

Keep all rows of both data frames

print(merge(data1, data2, by='name', all.x = TRUE, all.y = TRUE))

##       name age marks
## 1     AGUS  21    21
## 2    BOEDI  NA    23
## 3     BUDI  23    NA
## 4  CANTIKA  NA    21
## 5 CHANTIKA  21    NA
## 6     DEDE  20    20
## 7     EMAN  NA    33

merged_df <- merge(df1, df2, by = c("ID", "Name"), all = TRUE)
print(merged_df)

##   ID    Name Age.x Age.y Gender
## 1  1 JUNAEDI    60    NA   <NA>
## 2  2    KIKI    30    30   Male
## 3  3   BAGAS    25    25 Female
## 4  4  ENDANG    40    40   Male
## 5  5  JOSHUA    NA    35 Female

2 Method 2: package `dplyr`

The dplyr package in R offers a set of functions that simplify the process of data manipulation, including merging or joining datasets. The dplyr approach to merging is more intuitive and readable compared to the base R merge() function, and it allows for seamless handling of data. In dplyr, merging operations are carried out using the join functions, which are part of the dplyr syntax for combining data frames.

2.1 Inner Join

Use when you want to keep only rows with matching keys in both datasets.

library(dplyr)
merged_df_simple <- inner_join(df1, df2, by = "ID")
print(merged_df_simple)

##   ID Name.x Age.x Name.y Age.y Gender
## 1  2   KIKI    30   KIKI    30   Male
## 2  3  BAGAS    25  BAGAS    25 Female
## 3  4 ENDANG    40 ENDANG    40   Male

library(dplyr)
merged_df_multiple <- inner_join(df1, df2, by = c("ID", "Name", "Age"))
print(merged_df_multiple)

##   ID   Name Age Gender
## 1  2   KIKI  30   Male
## 2  3  BAGAS  25 Female
## 3  4 ENDANG  40   Male

2.2 Left Join

Use when you want to retain all rows from the left dataset, including rows that don’t have a match in the right dataset.

library("dplyr")

data1=data.frame('name'=c('SAPTA','RASYA','GIRI','GITA'), 
                 'age'=c(21,23,21,20))
print(data1)

##    name age
## 1 SAPTA  21
## 2 RASYA  23
## 3  GIRI  21
## 4  GITA  20

data2=data.frame('name'=c('SAPTA','RAHMAN','GIRI','GITA','PUTRA'), 
                 'marks'=c(21,23,21,20,30))
print(data2)

##     name marks
## 1  SAPTA    21
## 2 RAHMAN    23
## 3   GIRI    21
## 4   GITA    20
## 5  PUTRA    30

print(left_join(data1, data2, by='name'))

##    name age marks
## 1 SAPTA  21    21
## 2 RASYA  23    NA
## 3  GIRI  21    21
## 4  GITA  20    20

2.3 Right Join

Use when you want to retain all rows from the right dataset, including rows that don’t have a match in the left dataset.

print(right_join(data1, data2, by='name'))

##     name age marks
## 1  SAPTA  21    21
## 2   GIRI  21    21
## 3   GITA  20    20
## 4 RAHMAN  NA    23
## 5  PUTRA  NA    30

2.4 Full Join

Use when you want to keep all rows from both datasets, with NA filling in where a match is missing.

print(full_join(data1, data2, by='name'))

##     name age marks
## 1  SAPTA  21    21
## 2  RASYA  23    NA
## 3   GIRI  21    21
## 4   GITA  20    20
## 5 RAHMAN  NA    23
## 6  PUTRA  NA    30

2.5 Semi Join

Use when you want to filter the left dataset to only include rows that have matching keys in the right dataset, without adding any columns from the right dataset.

print(semi_join(data1, data2, by='name'))

##    name age
## 1 SAPTA  21
## 2  GIRI  21
## 3  GITA  20

2.6 Anti Join

Use when you want to filter the left dataset to exclude rows that have a match in the right dataset.

print(anti_join(data1, data2, by='name'))

##    name age
## 1 RASYA  23

3 Summary

Merge using merge()

Merge using package dplyr

4 Merging Data Susenas

I want to perform a merging process between the Susenas individual data and the Ruta data

View(Indv)
View(Ruta)
#Mengurutkan berdasarkan kolom 'URUT'
Indv <- Indv %>% arrange(URUT)
Ruta <- Ruta %>% arrange(URUT)

#Menggabungkan kedua file berdasarkan kolom 'URUT'
GAB_SSN <- left_join(Indv, Ruta, by = "URUT")

5 References

https://www.geeksforgeeks.org/how-to-do-a-left-join-in-r/

Kabacoff, R. I. (2015). R in Action: Data Analysis and Graphics with R. Manning Publications.

Wickham, H., & Grolemund, G. (2017). R for Data Science. O’Reilly Media. Online Book

Welfare Statistics Directorate, BPS, saptahas@bps.go.id

Merging Data

Sapta Hastho Ponco

2024-04-02

1 Method 1: `merge()`

1.1 Inner Join

1.2 Left Join

1.3 Right Join

1.4 Full Join

2 Method 2: package `dplyr`

2.1 Inner Join

2.2 Left Join

2.3 Right Join

2.4 Full Join

2.5 Semi Join

2.6 Anti Join

3 Summary

4 Merging Data Susenas

5 References

Merging Data

Sapta Hastho Ponco

2024-04-02

1 Method 1: merge()

1.1 Inner Join

1.2 Left Join

1.3 Right Join

1.4 Full Join

2 Method 2: package dplyr

2.1 Inner Join

2.2 Left Join

2.3 Right Join

2.4 Full Join

2.5 Semi Join

2.6 Anti Join

3 Summary

4 Merging Data Susenas

5 References

1 Method 1: `merge()`

2 Method 2: package `dplyr`