produciton.knit

[NMIC] Plot Looping

National Minerals Information Center

目標是使用NMIC資料繪製基本金屬的各國產量柱狀圖，並進行2021年與2020年的比較。特殊的要求在於七種金屬在不同的檔案，因此嘗試以迭代方式批量製作圖表以及合併輸出計算後的表格。

檔案連結：https://www.sciencebase.gov/catalog/item/6197ccbed34eb622f692ee1c

在Attached Files項目下點擊world.zip進行下載。資料夾中提供所有金屬的.xml檔和.csv檔，我們只需選基本金屬的.csv檔，將選取的檔案複製到目標資料夾。

Packages

套件package是R軟體的基礎上，增添新的模組，使我們可以執行更多功能。

所有套件在第一次使用前需要下載至電腦，使用install.packages() ，下載後套件內容就儲存於電腦中，下次使用只需要呼叫library() ，不必再下載。以下是本文中會使用到的package：

# install.packages("readxl")

library(magrittr)
library(readxl)
library(ggplot2) 
library(ggrepel) 
library(reshape2) 
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr) 
library(writexl) 
library(readr)
library(lattice)

File Name, Metal, Unit

安裝和呼叫完套件後，接者要指定工作資料夾，所有資料的呼叫和儲存都會在這個資料夾，注意Windows系統的資料夾位置預設的分隔符號是反斜線\，而在程式中要將反斜線改為斜線/，而MacOS使用的資料夾位置中，桌面和文件以前的位置可以使用波浪號~取代。使用setwd()設定工作目錄，括號內放入工作資料夾位置，getwd()可以查看目前的工作位置。

setwd("~/Desktop/R/CIER/National Mineral Information Center")

getwd()

## [1] "/Users/hasukeashita/Desktop/R/CIER/National Mineral Information Center"

為了後續可以迭代方便，我們需要先建立關於檔案和金屬名的向量。list.files()可以列出當前資料夾中所有的檔案，pattern可以指定檔案名的格式，由於我們的資料名稱格式為mcs2022-金屬_world.csv，我們可以設定只要檔名中有mcs2022的檔案，因此指定參數pattern = "msc2022"。俵且將此向量存為file_name這個變數。

file_name <- list.files(pattern = "mcs2022")

print(file_name)

## [1] "mcs2022-alumi_world.csv" "mcs2022-coppe_world.csv"
## [3] "mcs2022-lead_world.csv"  "mcs2022-mgmet_world.csv"
## [5] "mcs2022-nicke_world.csv" "mcs2022-tin_world.csv"  
## [7] "mcs2022-zinc_world.csv"

接者設定基本金屬名稱的向量，向量可以儲存許多元素，使用c()來建立向量。metal中儲存了英文的金屬名稱，而c_metal中儲存中文的金屬名稱。

metal <- c("Alumi", "Copper", "Lead", "Magne", "Nickel", "Tin", "Zinc")

c_metal <- c("鋁", "銅", "鉛", "鎂", "鎳", "錫", "鋅")

由於不同金屬中使用的單位不同，鎳跟錫使用的單位是噸，而其他的是千噸。因此先建立一個全是千噸的向量，再將屬於鎳和錫的單位改為噸。rep()可以製作重複的向量，x是要重複的元素，times是元素重複的次數。 []可以指定物件中的某些元素，unit[c(5, 6)]表示選取unit物件中第5和6個元素。

unit <- rep(x = "千噸", times = 7)
unit[c(5, 6)] <- "噸"
print(unit)

## [1] "千噸" "千噸" "千噸" "千噸" "噸"   "噸"   "千噸"

e_unit <- rep("thousand tons", 7)
e_unit[c(5, 6)] <- "tons"
print(e_unit)

## [1] "thousand tons" "thousand tons" "thousand tons" "thousand tons"
## [5] "tons"          "tons"          "thousand tons"

Data Cleaning

由於本文需要繪製7張柱狀圖，而每個柱狀圖的格式皆相同，使用的資料也為不同檔案，因此應使用迭代來取代重複的工作，但迭代不易看出資料整理以及繪製圖表的流程，因此先以alumi為例，了解執行的過程後，再來使用迭代。

使用read.csv()這個函式讀取鋁的csv檔"mcs2022-alumi_world.csv"，再將資料存df變數中。head()可以查看物件中前六項的資料。

df <- read.csv("mcs2022-alumi_world.csv")

head(df)

##    Source       Country               Type Prod_kt_2020 Prod_kt_Est_2021
## 1 MCS2022 United States smelter production         1012              880
## 2 MCS2022     Australia smelter production         1580             1600
## 3 MCS2022       Bahrain smelter production         1550             1500
## 4 MCS2022        Canada smelter production         3120             3100
## 5 MCS2022         China smelter production        37100            39000
## 6 MCS2022       Iceland smelter production          860              880

現在我們希望將資料轉換成直表，將多個column合成為一個column，並且計算佔比，最後再篩選佔比大於3的國家。最後資料會呈現為Country, Year, Prod, Percent。

With pip

%>%為pipe可以將運算結果傳遞給下一個函式的第一個參數，可以使程式碼更為簡潔，並且避免儲存過多無用的變數佔據記憶體空間。但下方仍會逐步拆解步驟，以了解每個函數的使用方式及產出。

df %>% 
  melt(id = c("Source", "Country", "Type"), 
       variable.name = "Year", value.name = "Prod") %>% 
  mutate(Year = str_sub(Year, start = -4L)) %>% 
  select(Country, Year, Prod) %>% 
  group_by(Year) %>% 
  mutate(Percent = (Prod / Prod[which(Country == "World total (rounded)")] * 100) %>% 
           round(2)) %>% 
  filter(! Country %in% c("World total (rounded)", "Other countries")
                       & Percent > 3) %>% 
  ungroup %>% 
  head

## # A tibble: 6 × 4
##   Country              Year   Prod Percent
##   <chr>                <chr> <int>   <dbl>
## 1 Canada               2020   3120    4.79
## 2 China                2020  37100   57.0 
## 3 India                2020   3560    5.47
## 4 Russia               2020   3640    5.59
## 5 United Arab Emirates 2020   2520    3.87
## 6 Canada               2021   3100    4.56

without pip

melt(id, variable.name, value.name)將資料進行轉置，將多個column合成為一個column，可以想像為寬表轉直表。在此我們希望將Prod_kt_2020和Prod_kt_Est_2021兩個column合成為一個表示Year的column和顯示產量的column。我們再將資料儲存至df_tidy這個變數。id這個參數指定哪些column不要合併，variable.name則指定合併過後，原先的變數名稱變成值後的變數名。value.name指定數值的名稱。

melt

df_tidy <- melt(data = df, id = c("Source", "Country", "Type"), 
       variable.name = "Year", value.name = "Prod")

head(df_tidy[c("Year", "Prod")])

##           Year  Prod
## 1 Prod_kt_2020  1012
## 2 Prod_kt_2020  1580
## 3 Prod_kt_2020  1550
## 4 Prod_kt_2020  3120
## 5 Prod_kt_2020 37100
## 6 Prod_kt_2020   860

mutate, str_sub

mutate(data, colname = value) mutate的功能是新增一列（column），等號左邊放新column的名稱，等號右邊放入數值。在此我們要新增一個名為Year的column，數值是原先的Year中抽取最後四個字。

print(df_tidy$Year)

##  [1] Prod_kt_2020     Prod_kt_2020     Prod_kt_2020     Prod_kt_2020    
##  [5] Prod_kt_2020     Prod_kt_2020     Prod_kt_2020     Prod_kt_2020    
##  [9] Prod_kt_2020     Prod_kt_2020     Prod_kt_2020     Prod_kt_2020    
## [13] Prod_kt_Est_2021 Prod_kt_Est_2021 Prod_kt_Est_2021 Prod_kt_Est_2021
## [17] Prod_kt_Est_2021 Prod_kt_Est_2021 Prod_kt_Est_2021 Prod_kt_Est_2021
## [21] Prod_kt_Est_2021 Prod_kt_Est_2021 Prod_kt_Est_2021 Prod_kt_Est_2021
## Levels: Prod_kt_2020 Prod_kt_Est_2021

str_sub(string, start=1L, end=-1L)，str_sub抽取字串中的元素，從1L抽取至-1L，1L表示第一個，-1L表示最後一個（倒數第一個）。在本處Year字串最後四個字剛好是我們要的年份，因此我們指定start = -4L，而end不指定就是到最後。

df_tidy <- mutate(Year = str_sub(Year, start = -4L),
                  .data = df_tidy)

print(df_tidy$Year)

##  [1] "2020" "2020" "2020" "2020" "2020" "2020" "2020" "2020" "2020" "2020"
## [11] "2020" "2020" "2021" "2021" "2021" "2021" "2021" "2021" "2021" "2021"
## [21] "2021" "2021" "2021" "2021"

select

接著挑選Country、Year、Prod欄位，使用select(.data, ...)，直接填入需要的變數名。

df_tidy <- select(.data = df_tidy, Country, Year, Prod)

head(df_tidy)

##         Country Year  Prod
## 1 United States 2020  1012
## 2     Australia 2020  1580
## 3       Bahrain 2020  1550
## 4        Canada 2020  3120
## 5         China 2020 37100
## 6       Iceland 2020   860

group_by, mutate

接著我們想要計算每個國家的佔比，在Country中除了每個國家外，還有World和Other，佔比的計算方法是將國家的產量除以世界產量。首先將資料依據年份分組，使用group_by(.data, ...)，函數中填入作為分組依據的變數，Year中有2020與2021兩種數值，因此整個資料會被分作兩組。每一組內都有一個World的資料。

filter(.data = df_tidy, Country == "World total (rounded)")

##                 Country Year  Prod
## 1 World total (rounded) 2020 65100
## 2 World total (rounded) 2021 68000

df_tidy$Prod[which(df_tidy$Country == "World total (rounded)")]

## [1] 65100 68000

df_tidy <- mutate(Percent = (Prod/Prod[which(Country == "World total (rounded)")] * 100) %>% 
         round(2),
       .data = df_tidy)

head(df_tidy)

##         Country Year  Prod Percent
## 1 United States 2020  1012    1.55
## 2     Australia 2020  1580    2.32
## 3       Bahrain 2020  1550    2.38
## 4        Canada 2020  3120    4.59
## 5         China 2020 37100   56.99
## 6       Iceland 2020   860    1.26

filter

百分比的資料新增好後，我們要篩除World和Other的資料，並且只保留佔比達3%以上國家。filter函數內放入篩選的條件。

df_tidy <- filter(.data = df_tidy,
       (! Country %in% c("World total (rounded)", "Other countries")) &
         Percent >= 3)

df_tidy

##                 Country Year  Prod Percent
## 1                Canada 2020  3120    4.59
## 2                 China 2020 37100   56.99
## 3                 India 2020  3560    5.47
## 4                Russia 2020  3640    5.59
## 5  United Arab Emirates 2020  2520    3.71
## 6                Canada 2021  3100    4.56
## 7                 China 2021 39000   59.91
## 8                 India 2021  3900    5.99
## 9                Russia 2021  3700    5.68
## 10 United Arab Emirates 2021  2600    3.82

Bar Chart

資料清洗完後，可以開始繪製長條圖，我們想要Y軸為產量，x軸為國家，並且用年份分組。barchart是lattice繪圖系統中繪製長條圖的函數，另一個更常見的繪圖系統是ggplot2。長條圖中需指定Y ~ x的變數，分組group的變數，其他為一些繪圖參數設定，非必要設定，可以參考文件，或使用??barchart查看說明。

barchart(Prod ~ Country, group = Year, data = df_tidy,
         horizental = TRUE, aspec = 0.5,
         auto.key = list(space = "inside"),
         main = paste0("Alumi", " Production Quantity Bar Chart (2020-2021)"),
         xlab = "Country", ylab = paste0("amount(", "kt", ")"),
         sub = "Source: National Mineral Information Center",
         scales=list(x=list(cex=1.2), y = list(cex=1.2)),
         cex.main = 3) %>% 
  print

Looping

在以Alumi作為範例展示完整理後的表格以及長條圖後，我們希望能得到將七個金屬的整理後表格組成的list，以及七個圖表的png檔。

其實以apply、map等函數也可以達到此目的，但過程需將所有動作包裝成function，方法較為抽象。我們此處先使用for迴圈進行迭代。

在一開始我們已經準備了檔案名的list，以及金屬名、單位的向量。

print(file_name)

## [1] "mcs2022-alumi_world.csv" "mcs2022-coppe_world.csv"
## [3] "mcs2022-lead_world.csv"  "mcs2022-mgmet_world.csv"
## [5] "mcs2022-nicke_world.csv" "mcs2022-tin_world.csv"  
## [7] "mcs2022-zinc_world.csv"

print(metal)

## [1] "Alumi"  "Copper" "Lead"   "Magne"  "Nickel" "Tin"    "Zinc"

print(e_unit)

## [1] "thousand tons" "thousand tons" "thousand tons" "thousand tons"
## [5] "tons"          "tons"          "thousand tons"

print(c(file_name[1], metal[1], e_unit[1]))

## [1] "mcs2022-alumi_world.csv" "Alumi"                  
## [3] "thousand tons"

for

我們先創立一個名為output的list用以儲存整理後的表格。迴圈的重點在於將個別金屬的資訊使用i來指定。如file_name[i]、metal[i]、e_unit[i]。

若直接執行的話會出現錯誤，使用回圈的好處是我們可以看到迴圈運行到lead，也就是i = 3時發生錯誤，原因是在lead的csv檔案中，數字1900被表示成"1,900"，可以使用記事本或文字編輯工具將"1,900"改為1900。以此類推還有數值需改正。

因為此狀況數量不多，手動修改較為方便。若數量龐大的情況可以使用str_replace(pattern = ",", replace = "")以及as.numeric。

output <- list()

for (i in 1:7){
  
  df <- read.csv(file_name[i])
  
  df <- df %>% 
  melt(id = c("Source", "Country", "Type"), 
       variable.name = "Year", value.name = "Prod") %>% 
  mutate(Year = str_sub(Year, start = -4L)) %>% 
  select(Country, Year, Prod) %>% 
  group_by(Year) %>% 
  mutate(
    Percent = (Prod / Prod[which(Country == "World total (rounded)")] * 100) %>% 
           round(2)) %>% 
  filter(! Country %in% c("World total (rounded)", "Other countries")
                       & Percent > 3) %>% 
  ungroup
  
  output[[metal[i]]] <- df
  
  png(str_c(metal[i], ".png"), width = 800, height = 480)
  
  barchart(Prod ~ Country, group = Year, data = df_tidy,
         horizental = TRUE, aspec = 0.5,
         auto.key = list(space = "inside"),
         main = paste0(metal[i], " Production Quantity Bar Chart (2020-2021)"),
         xlab = "Country", ylab = paste0("amount(", e_unit[i], ")"),
         sub = "Source: National Mineral Information Center",
         scales=list(x=list(cex=1.2), y = list(cex=1.2)),
         cex.main = 3) %>% 
    print() 
  
  dev.off()
  
}

輸出圖表的方式是將要輸出的圖片放在png和dev.off之間，png函數內指定輸出的檔案名、圖片寬度、圖片長度。

Export data

output為一個list，裡面存放七個整理過後的表格，並且以金屬作為元素（表格）名。 write_xlsx將output輸出為一個叫做Production_Data_Tidied.xlsx的xlsx檔。

print(output)

## $Alumi
## # A tibble: 10 × 4
##    Country              Year   Prod Percent
##    <chr>                <chr> <int>   <dbl>
##  1 Canada               2020   3120    4.79
##  2 China                2020  37100   57.0 
##  3 India                2020   3560    5.47
##  4 Russia               2020   3640    5.59
##  5 United Arab Emirates 2020   2520    3.87
##  6 Canada               2021   3100    4.56
##  7 China                2021  39000   57.4 
##  8 India                2021   3900    5.74
##  9 Russia               2021   3700    5.44
## 10 United Arab Emirates 2021   2600    3.82
## 
## $Copper
## # A tibble: 19 × 4
##    Country         Year   Prod Percent
##    <chr>           <chr> <int>   <dbl>
##  1 "United States" 2020   1200    5.83
##  2 "Australia"     2020    885    4.3 
##  3 "Chile"         2020   5730   27.8 
##  4 "China"         2020   1720    8.35
##  5 "Congo "        2020   1600    7.77
##  6 "Mexico"        2020    733    3.56
##  7 "Peru"          2020   2150   10.4 
##  8 "Russia"        2020    810    3.93
##  9 "Zambia"        2020    853    4.14
## 10 "United States" 2021   1200    5.71
## 11 "Australia"     2021    900    4.29
## 12 "Chile"         2021   5600   26.7 
## 13 "China"         2021   1800    8.57
## 14 "Congo "        2021   1800    8.57
## 15 "Indonesia"     2021    810    3.86
## 16 "Mexico"        2021    720    3.43
## 17 "Peru"          2021   2200   10.5 
## 18 "Russia"        2021    820    3.9 
## 19 "Zambia"        2021    830    3.95
## 
## $Lead
## # A tibble: 14 × 4
##    Country       Year   Prod Percent
##    <chr>         <chr> <int>   <dbl>
##  1 United States 2020    306    6.99
##  2 Australia     2020    494   11.3 
##  3 China         2020   1900   43.4 
##  4 India         2020    204    4.66
##  5 Mexico        2020    260    5.94
##  6 Peru          2020    242    5.53
##  7 Russia        2020    210    4.79
##  8 United States 2021    300    6.98
##  9 Australia     2021    500   11.6 
## 10 China         2021   2000   46.5 
## 11 India         2021    210    4.88
## 12 Mexico        2021    270    6.28
## 13 Peru          2021    280    6.51
## 14 Russia        2021    210    4.88
## 
## $Magne
## # A tibble: 4 × 4
##   Country Year   Prod Percent
##   <chr>   <chr> <int>   <dbl>
## 1 China   2020    886   88.6 
## 2 Russia  2020     48    4.8 
## 3 China   2021    800   84.2 
## 4 Russia  2021     60    6.32
## 
## $Nickel
## # A tibble: 16 × 4
##    Country          Year     Prod Percent
##    <chr>            <chr>   <int>   <dbl>
##  1 "Australia"      2020   169000    6.73
##  2 "Brazil"         2020    77100    3.07
##  3 "Canada"         2020   167000    6.65
##  4 "China"          2020   120000    4.78
##  5 "Indonesia"      2020   771000   30.7 
##  6 "New Caledonia " 2020   200000    7.97
##  7 "Philippines"    2020   334000   13.3 
##  8 "Russia"         2020   283000   11.3 
##  9 "Australia"      2021   160000    5.93
## 10 "Brazil"         2021   100000    3.7 
## 11 "Canada"         2021   130000    4.81
## 12 "China"          2021   120000    4.44
## 13 "Indonesia"      2021  1000000   37.0 
## 14 "New Caledonia " 2021   190000    7.04
## 15 "Philippines"    2021   370000   13.7 
## 16 "Russia"         2021   250000    9.26
## 
## $Tin
## # A tibble: 15 × 4
##    Country   Year   Prod Percent
##    <chr>     <chr> <int>   <dbl>
##  1 Australia 2020   8120    3.08
##  2 Bolivia   2020  14700    5.57
##  3 Brazil    2020  16900    6.4 
##  4 Burma     2020  29000   11.0 
##  5 China     2020  84000   31.8 
##  6 Congo     2020  17300    6.55
##  7 Indonesia 2020  53000   20.1 
##  8 Peru      2020  20600    7.8 
##  9 Bolivia   2021  18000    6   
## 10 Brazil    2021  22000    7.33
## 11 Burma     2021  28000    9.33
## 12 China     2021  91000   30.3 
## 13 Congo     2021  16000    5.33
## 14 Indonesia 2021  71000   23.7 
## 15 Peru      2021  30000   10   
## 
## $Zinc
## # A tibble: 13 × 4
##    Country       Year   Prod Percent
##    <chr>         <chr> <int>   <dbl>
##  1 United States 2020    718    5.98
##  2 Australia     2020   1310   10.9 
##  3 China         2020   4060   33.8 
##  4 India         2020    720    6   
##  5 Mexico        2020    638    5.32
##  6 Peru          2020   1330   11.1 
##  7 United States 2021    740    5.69
##  8 Australia     2021   1300   10   
##  9 Bolivia       2021    490    3.77
## 10 China         2021   4200   32.3 
## 11 India         2021    810    6.23
## 12 Mexico        2021    720    5.54
## 13 Peru          2021   1600   12.3

write_xlsx(output, "Production_Data_Tidied.xlsx")