Merging data is a fundamental operation in data analysis that allows you to combine datasets from different sources or tables into a single dataset based on common variables (or “keys”). This operation is commonly needed when working with relational data, where different pieces of information are stored in separate datasets.
Key Concepts of Data Merging
Keys (Joining Variables): The columns on which the merge is based are called the “key” columns. These can be one or more variables that appear in both data frames. The by argument in merge() specifies these key columns.
Types of Joins: Depending on how you want to handle unmatched rows, there are several types of joins:
Inner Join: Keeps only rows with matching keys in both datasets.
Left Join (Left Outer Join): Keeps all rows from the left dataset and only matching rows from the right dataset.
Right Join (Right Outer Join): Keeps all rows from the right dataset and only matching rows from the left dataset.
Full Join (Full Outer Join): Keeps all rows from both datasets, filling with NA where there is no match.
library(haven)
merge()
In R, the merge()
function is the most commonly used
method to combine datasets. It allows you to merge two data frames by
matching values of one or more columns (keys) between the two
datasets.
The merge()
function provides a flexible way to perform
these operations and can handle various scenarios, including mismatched
data or missing values.
data1=data.frame('name'=c('AGUS','BUDI','CHANTIKA','DEDE'),
'age'=c(21,23,21,20))
print(data1)
## name age
## 1 AGUS 21
## 2 BUDI 23
## 3 CHANTIKA 21
## 4 DEDE 20
data2=data.frame('name'=c('AGUS','BOEDI','CANTIKA','DEDE','EMAN'),
'marks'=c(21,23,21,20,33))
print(data2)
## name marks
## 1 AGUS 21
## 2 BOEDI 23
## 3 CANTIKA 21
## 4 DEDE 20
## 5 EMAN 33
EXAMPLE 1:
print(merge(data1, data2, by='name', all.x=FALSE))
## name age marks
## 1 AGUS 21 21
## 2 DEDE 20 20
The above table illustrates the merging process of the two data frames. As you can see based on the table, the two data frames have the names AGUS dan DEDE in common. The other names were therefore not kept in the joined data frame. In statistical research literature, this type of merging is often called inner join.
print(merge(data1, data2, by='name'))
## name age marks
## 1 AGUS 21 21
## 2 DEDE 20 20
If all.x = TRUE
is not specified, it defaults to
all.x = FALSE
EXAMPLE 2:
In an intersection, because all = FALSE
df1 <- data.frame(ID = c(1, 2, 3, 4),
Name = c("JUNAEDI", "KIKI", "BAGAS", "ENDANG"),
Age = c(60, 30, 25, 40))
df2 <- data.frame(ID = c(2, 3, 4, 5),
Name = c("KIKI", "BAGAS", "ENDANG",'JOSHUA'),
Age = c(30, 25, 40, 35),
Gender = c("Male", "Female", "Male", "Female"))
merged_df <- merge(df1, df2, by = "ID", all = FALSE)
print(merged_df)
## ID Name.x Age.x Name.y Age.y Gender
## 1 2 KIKI 30 KIKI 30 Male
## 2 3 BAGAS 25 BAGAS 25 Female
## 3 4 ENDANG 40 ENDANG 40 Male
EXAMPLE 3:
merged_df <- merge(df1, df2, by = c("ID", "Name"), all = FALSE)
print(merged_df)
## ID Name Age.x Age.y Gender
## 1 2 KIKI 30 30 Male
## 2 3 BAGAS 25 25 Female
## 3 4 ENDANG 40 40 Male
The merge function provides the options all.x and all.y. These two options can be used to retain certain rows of your input data tables, even when no match is found for the merging. This type of merging is often called left join.
Keep all rows of x-data
The data on the left is used as the base because
all.x = TRUE
print(merge(data1, data2, by='name', all.x=TRUE))
## name age marks
## 1 AGUS 21 21
## 2 BUDI 23 NA
## 3 CHANTIKA 21 NA
## 4 DEDE 20 20
merged_df <- merge(df1, df2, by = "ID", all.x = TRUE)
print(merged_df)
## ID Name.x Age.x Name.y Age.y Gender
## 1 1 JUNAEDI 60 <NA> NA <NA>
## 2 2 KIKI 30 KIKI 30 Male
## 3 3 BAGAS 25 BAGAS 25 Female
## 4 4 ENDANG 40 ENDANG 40 Male
Keep all rows of y-data
print(merge(data1, data2, by='name', all.y=TRUE))
## name age marks
## 1 AGUS 21 21
## 2 BOEDI NA 23
## 3 CANTIKA NA 21
## 4 DEDE 20 20
## 5 EMAN NA 33
merged_df <- merge(df1, df2, by = "ID", all.y=TRUE)
print(merged_df)
## ID Name.x Age.x Name.y Age.y Gender
## 1 2 KIKI 30 KIKI 30 Male
## 2 3 BAGAS 25 BAGAS 25 Female
## 3 4 ENDANG 40 ENDANG 40 Male
## 4 5 <NA> NA JOSHUA 35 Female
Keep all rows of both data frames
print(merge(data1, data2, by='name', all.x = TRUE, all.y = TRUE))
## name age marks
## 1 AGUS 21 21
## 2 BOEDI NA 23
## 3 BUDI 23 NA
## 4 CANTIKA NA 21
## 5 CHANTIKA 21 NA
## 6 DEDE 20 20
## 7 EMAN NA 33
merged_df <- merge(df1, df2, by = c("ID", "Name"), all = TRUE)
print(merged_df)
## ID Name Age.x Age.y Gender
## 1 1 JUNAEDI 60 NA <NA>
## 2 2 KIKI 30 30 Male
## 3 3 BAGAS 25 25 Female
## 4 4 ENDANG 40 40 Male
## 5 5 JOSHUA NA 35 Female
dplyr
The dplyr
package in R offers a set of functions that
simplify the process of data manipulation, including merging or joining
datasets. The dplyr
approach to merging is more intuitive
and readable compared to the base R merge()
function, and
it allows for seamless handling of data. In dplyr
, merging
operations are carried out using the join functions, which are part of
the dplyr
syntax for combining data frames.
Use when you want to keep only rows with matching keys in both datasets.
library(dplyr)
merged_df_simple <- inner_join(df1, df2, by = "ID")
print(merged_df_simple)
## ID Name.x Age.x Name.y Age.y Gender
## 1 2 KIKI 30 KIKI 30 Male
## 2 3 BAGAS 25 BAGAS 25 Female
## 3 4 ENDANG 40 ENDANG 40 Male
library(dplyr)
merged_df_multiple <- inner_join(df1, df2, by = c("ID", "Name", "Age"))
print(merged_df_multiple)
## ID Name Age Gender
## 1 2 KIKI 30 Male
## 2 3 BAGAS 25 Female
## 3 4 ENDANG 40 Male
Use when you want to retain all rows from the left dataset, including rows that don’t have a match in the right dataset.
library("dplyr")
data1=data.frame('name'=c('SAPTA','RASYA','GIRI','GITA'),
'age'=c(21,23,21,20))
print(data1)
## name age
## 1 SAPTA 21
## 2 RASYA 23
## 3 GIRI 21
## 4 GITA 20
data2=data.frame('name'=c('SAPTA','RAHMAN','GIRI','GITA','PUTRA'),
'marks'=c(21,23,21,20,30))
print(data2)
## name marks
## 1 SAPTA 21
## 2 RAHMAN 23
## 3 GIRI 21
## 4 GITA 20
## 5 PUTRA 30
print(left_join(data1, data2, by='name'))
## name age marks
## 1 SAPTA 21 21
## 2 RASYA 23 NA
## 3 GIRI 21 21
## 4 GITA 20 20
Use when you want to retain all rows from the right dataset, including rows that don’t have a match in the left dataset.
print(right_join(data1, data2, by='name'))
## name age marks
## 1 SAPTA 21 21
## 2 GIRI 21 21
## 3 GITA 20 20
## 4 RAHMAN NA 23
## 5 PUTRA NA 30
Use when you want to keep all rows from both datasets, with NA filling in where a match is missing.
print(full_join(data1, data2, by='name'))
## name age marks
## 1 SAPTA 21 21
## 2 RASYA 23 NA
## 3 GIRI 21 21
## 4 GITA 20 20
## 5 RAHMAN NA 23
## 6 PUTRA NA 30
Use when you want to filter the left dataset to only include rows that have matching keys in the right dataset, without adding any columns from the right dataset.
print(semi_join(data1, data2, by='name'))
## name age
## 1 SAPTA 21
## 2 GIRI 21
## 3 GITA 20
Use when you want to filter the left dataset to exclude rows that have a match in the right dataset.
print(anti_join(data1, data2, by='name'))
## name age
## 1 RASYA 23
Merge using merge()
Merge using package dplyr
I want to perform a merging process between the Susenas individual data and the Ruta data
View(Indv)
View(Ruta)
#Mengurutkan berdasarkan kolom 'URUT'
Indv <- Indv %>% arrange(URUT)
Ruta <- Ruta %>% arrange(URUT)
#Menggabungkan kedua file berdasarkan kolom 'URUT'
GAB_SSN <- left_join(Indv, Ruta, by = "URUT")
https://www.geeksforgeeks.org/how-to-do-a-left-join-in-r/
Kabacoff, R. I. (2015). R in Action: Data Analysis and Graphics with R. Manning Publications.
Wickham, H., & Grolemund, G. (2017). R for Data Science. O’Reilly Media. Online Book
Welfare Statistics Directorate, BPS, saptahas@bps.go.id