Data wrangling: Homework 4

2020-Spring [Data Management] Instructor: SHEU, Ching-Fan

CHIU, Ming-Tzu

2020-04-12

Reverse the order of input to the series of dplyr::*_join examples using data from the Nobel laureates in literature and explain the resulting output.

讀取資料

dta1 <- read.table("C:/Users/TheorEco Lab/Desktop/108-2/DataManagement/0330/nobel_countries.txt", header=T)
dta2 <- read.table("C:/Users/TheorEco Lab/Desktop/108-2/DataManagement/0330/nobel_winners.txt", header=T)
str(dta1)
#> 'data.frame':    8 obs. of  2 variables:
#>  $ Country: Factor w/ 7 levels "Canada","China",..: 3 6 6 7 1 2 4 5
#>  $ Year   : int  2014 1950 2017 2016 2013 2012 2015 2011
str(dta2)
#> 'data.frame':    7 obs. of  3 variables:
#>  $ Name  : Factor w/ 7 levels "Alice  Munro",..: 6 2 4 3 1 5 7
#>  $ Gender: Factor w/ 2 levels "Female","Male": 2 2 2 2 1 2 1
#>  $ Year  : int  2014 1950 2017 2016 2013 2012 1938

資料合併

dplyr::inner_join(dta1, dta2)
#> Joining, by = "Year"
#>   Country Year              Name Gender
#> 1  France 2014   Patrick Modiano   Male
#> 2      UK 1950 Bertrand  Russell   Male
#> 3      UK 2017    Kazuo Ishiguro   Male
#> 4      US 2016        Bob  Dylan   Male
#> 5  Canada 2013      Alice  Munro Female
#> 6   China 2012            Mo Yan   Male

兩筆資料根據其年份合併並顯示年份有所交集的 6 筆觀察值,兩個資料的所有欄位皆有顯示。

dplyr::semi_join(dta1, dta2)
#> Joining, by = "Year"
#>   Country Year
#> 1  France 2014
#> 2      UK 1950
#> 3      UK 2017
#> 4      US 2016
#> 5  Canada 2013
#> 6   China 2012

根據年份合併,一樣只顯示有交集的年份資料,但只顯示 dta1 有的欄位。

dplyr::left_join(dta1, dta2)
#> Joining, by = "Year"
#>   Country Year              Name Gender
#> 1  France 2014   Patrick Modiano   Male
#> 2      UK 1950 Bertrand  Russell   Male
#> 3      UK 2017    Kazuo Ishiguro   Male
#> 4      US 2016        Bob  Dylan   Male
#> 5  Canada 2013      Alice  Munro Female
#> 6   China 2012            Mo Yan   Male
#> 7  Russia 2015              <NA>   <NA>
#> 8  Sweden 2011              <NA>   <NA>

所有 dta1 的國家都有呈現,也就是呈現其 row,欄位則是同時呈現 dta1 和 dta2 的聯集。

dplyr::anti_join(dta1, dta2)
#> Joining, by = "Year"
#>   Country Year
#> 1  Russia 2015
#> 2  Sweden 2011

顯示 dta1 的欄位資料,但只顯示只有 dta1 有但 dta2 沒有的觀察值。

dplyr::full_join(dta1, dta2)
#> Joining, by = "Year"
#>   Country Year              Name Gender
#> 1  France 2014   Patrick Modiano   Male
#> 2      UK 1950 Bertrand  Russell   Male
#> 3      UK 2017    Kazuo Ishiguro   Male
#> 4      US 2016        Bob  Dylan   Male
#> 5  Canada 2013      Alice  Munro Female
#> 6   China 2012            Mo Yan   Male
#> 7  Russia 2015              <NA>   <NA>
#> 8  Sweden 2011              <NA>   <NA>
#> 9    <NA> 1938        Pearl Buck Female

顯示完全 dta1 和 dta2 的聯集,所有的欄與行皆呈現,年份重複的則合併,無法合併的則有遺漏值。