This challenge will focus on joining multiple datasets and analyzes aspects of the newly formed joint data.
We are using two datasets. Both are FAO stats with one dataset related to livestock and the other about chicken and eggs.
First, the libraries are loaded.
library(readr)
library(here)
## here() starts at C:/Users/SHAURYA/Desktop/Studies/Winter 2024 601/Challenges/challenge 8
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
We then load the datasets.
livestock <- read_csv("FAOSTAT_livestock.csv", show_col_types = FALSE)
egg <- read_csv("FAOSTAT_egg_chicken.csv", show_col_types = FALSE)
We can have a look at the data.
head(livestock)
## # A tibble: 6 × 14
## `Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## 2 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## 3 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## 4 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## 5 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## 6 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## # Value <dbl>, Flag <chr>, `Flag Description` <chr>
We see a variety of columns such as Domain, Area, Year, Item, and Value. It has a mix of both numerical and categorical data which is useful for further analysis.
An explanation of data in each column can be given-
Domain Code - Represents the national website domain of a country. In this case, it is Qatar.
Domain - Describes the relvant domain of the livestock. Here, all the animals are in the Live Animals domain.
Area Code - Represents the code assigned to each country.
Area - Represents the country.
Element Code - Numerical value assigned to the element.
Element - Shows the element for each animal. Here, it is common for all being stocks.
Item Code - Code assigned to every item.
Item - Represents the category of the livestock such as Asses, Buffaloes, Pigs, etc.
Year Code and Year - Redundant columns representing the year.
Unit - Represents the quantity. Here it is common being Head.
Value - Assigns the numerical value of the livestock. It varies across the data.
Flag and Flag Description - It is a symbol and category combination. Such as blank for official data, F for FAO estimate, * for Unofficial Figure, etc.
head(egg)
## # A tibble: 6 × 14
## `Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 QL Livestock … 2 Afgh… 5313 Laying 1062
## 2 QL Livestock … 2 Afgh… 5410 Yield 1062
## 3 QL Livestock … 2 Afgh… 5510 Produc… 1062
## 4 QL Livestock … 2 Afgh… 5313 Laying 1062
## 5 QL Livestock … 2 Afgh… 5410 Yield 1062
## 6 QL Livestock … 2 Afgh… 5510 Produc… 1062
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## # Value <dbl>, Flag <chr>, `Flag Description` <chr>
While most attributes are same as the livestock data, the description slightly differs.
Domain - Describes the relvant domain of the chicken. Here, it is common for all rows.
Element - Shows the element for each chicken and egg.
Item - Represents the category of the chicken and eggs. It is common for all.
Unit - Represents the quantity. It varies among the rows.
Value - Assigns the numerical value of the product. It varies across the data.
The dimensions of the data can be obtained by-
dim(livestock)
## [1] 82116 14
dim(egg)
## [1] 38170 14
We see that the livestock dataset is twice was big as the egg and chicken data.
We can look at the different columns -
colnames(livestock)
## [1] "Domain Code" "Domain" "Area Code" "Area"
## [5] "Element Code" "Element" "Item Code" "Item"
## [9] "Year Code" "Year" "Unit" "Value"
## [13] "Flag" "Flag Description"
colnames(egg)
## [1] "Domain Code" "Domain" "Area Code" "Area"
## [5] "Element Code" "Element" "Item Code" "Item"
## [9] "Year Code" "Year" "Unit" "Value"
## [13] "Flag" "Flag Description"
We can have a look at specific columns of the data using the select function.
select(livestock, "Area")
## # A tibble: 82,116 × 1
## Area
## <chr>
## 1 Afghanistan
## 2 Afghanistan
## 3 Afghanistan
## 4 Afghanistan
## 5 Afghanistan
## 6 Afghanistan
## 7 Afghanistan
## 8 Afghanistan
## 9 Afghanistan
## 10 Afghanistan
## # ℹ 82,106 more rows
select(egg, "Flag")
## # A tibble: 38,170 × 1
## Flag
## <chr>
## 1 F
## 2 Fc
## 3 F
## 4 F
## 5 Fc
## 6 F
## 7 F
## 8 Fc
## 9 F
## 10 F
## # ℹ 38,160 more rows
Multiple columns can also be selected-
select(livestock, "Area", "Item", "Value")
## # A tibble: 82,116 × 3
## Area Item Value
## <chr> <chr> <dbl>
## 1 Afghanistan Asses 1300000
## 2 Afghanistan Asses 851850
## 3 Afghanistan Asses 1001112
## 4 Afghanistan Asses 1150000
## 5 Afghanistan Asses 1300000
## 6 Afghanistan Asses 1200000
## 7 Afghanistan Asses 1200000
## 8 Afghanistan Asses 1328000
## 9 Afghanistan Asses 1250000
## 10 Afghanistan Asses 1300000
## # ℹ 82,106 more rows
select(egg, "Element", "Item", "Year")
## # A tibble: 38,170 × 3
## Element Item Year
## <chr> <chr> <dbl>
## 1 Laying Eggs, hen, in shell 1961
## 2 Yield Eggs, hen, in shell 1961
## 3 Production Eggs, hen, in shell 1961
## 4 Laying Eggs, hen, in shell 1962
## 5 Yield Eggs, hen, in shell 1962
## 6 Production Eggs, hen, in shell 1962
## 7 Laying Eggs, hen, in shell 1963
## 8 Yield Eggs, hen, in shell 1963
## 9 Production Eggs, hen, in shell 1963
## 10 Laying Eggs, hen, in shell 1964
## # ℹ 38,160 more rows
We see that elements can be either of Laying, Yield, or Production. This likely shows the purpose of raising chickens and eggs.
Tables can be used for checking the distribution of values within a column. For this, we can first retrieve the particular column.
Livestock_item <- select(livestock, Item)
head(Livestock_item)
## # A tibble: 6 × 1
## Item
## <chr>
## 1 Asses
## 2 Asses
## 3 Asses
## 4 Asses
## 5 Asses
## 6 Asses
table(Livestock_item)
## Item
## Asses Buffaloes Camels Cattle Goats Horses Mules Pigs
## 8571 3505 3265 13086 12498 11104 6153 12015
## Sheep
## 11919
From above we get the distribution of the livestock type from the data. We see that cattle appears the highest and Camels the lowest.
We can also get the proportion of each value with the help of proportion tables.
prop.table(table(Livestock_item))
## Item
## Asses Buffaloes Camels Cattle Goats Horses Mules
## 0.10437674 0.04268352 0.03976083 0.15935993 0.15219933 0.13522334 0.07493059
## Pigs Sheep
## 0.14631740 0.14514833
If we want to look at specific portions of data then we can use the filter function. For instance, if we want to see the data from Cuba then we can do the following-
filter(egg, `Area` == "Cuba")
## # A tibble: 174 × 14
## `Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 QL Livestock… 49 Cuba 5313 Laying 1062
## 2 QL Livestock… 49 Cuba 5410 Yield 1062
## 3 QL Livestock… 49 Cuba 5510 Produc… 1062
## 4 QL Livestock… 49 Cuba 5313 Laying 1062
## 5 QL Livestock… 49 Cuba 5410 Yield 1062
## 6 QL Livestock… 49 Cuba 5510 Produc… 1062
## 7 QL Livestock… 49 Cuba 5313 Laying 1062
## 8 QL Livestock… 49 Cuba 5410 Yield 1062
## 9 QL Livestock… 49 Cuba 5510 Produc… 1062
## 10 QL Livestock… 49 Cuba 5313 Laying 1062
## # ℹ 164 more rows
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## # Value <dbl>, Flag <chr>, `Flag Description` <chr>
If we want data from 2000 and recent then we can use the filter function for that as well.
filter(egg, `Year` >= 2000)
## # A tibble: 13,406 × 14
## `Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 QL Livestock… 2 Afgh… 5313 Laying 1062
## 2 QL Livestock… 2 Afgh… 5410 Yield 1062
## 3 QL Livestock… 2 Afgh… 5510 Produc… 1062
## 4 QL Livestock… 2 Afgh… 5313 Laying 1062
## 5 QL Livestock… 2 Afgh… 5410 Yield 1062
## 6 QL Livestock… 2 Afgh… 5510 Produc… 1062
## 7 QL Livestock… 2 Afgh… 5313 Laying 1062
## 8 QL Livestock… 2 Afgh… 5410 Yield 1062
## 9 QL Livestock… 2 Afgh… 5510 Produc… 1062
## 10 QL Livestock… 2 Afgh… 5313 Laying 1062
## # ℹ 13,396 more rows
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## # Value <dbl>, Flag <chr>, `Flag Description` <chr>
We can use the group_by function with summarise to get the total value from every area in the data. It will require two steps. First step will be to group the data by the area. Second step involves calculating the aggregate value for each area.
area_data <- egg %>% group_by(Area)
area_data
## # A tibble: 38,170 × 14
## # Groups: Area [245]
## `Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 QL Livestock… 2 Afgh… 5313 Laying 1062
## 2 QL Livestock… 2 Afgh… 5410 Yield 1062
## 3 QL Livestock… 2 Afgh… 5510 Produc… 1062
## 4 QL Livestock… 2 Afgh… 5313 Laying 1062
## 5 QL Livestock… 2 Afgh… 5410 Yield 1062
## 6 QL Livestock… 2 Afgh… 5510 Produc… 1062
## 7 QL Livestock… 2 Afgh… 5313 Laying 1062
## 8 QL Livestock… 2 Afgh… 5410 Yield 1062
## 9 QL Livestock… 2 Afgh… 5510 Produc… 1062
## 10 QL Livestock… 2 Afgh… 5313 Laying 1062
## # ℹ 38,160 more rows
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## # Value <dbl>, Flag <chr>, `Flag Description` <chr>
agg_val <- area_data %>% summarize(total_value = sum(Value, na.rm = TRUE))
agg_val
## # A tibble: 245 × 2
## Area total_value
## <chr> <dbl>
## 1 Afghanistan 2645330
## 2 Africa 112454830
## 3 Albania 5363511
## 4 Algeria 11134303
## 5 American Samoa 2300997
## 6 Americas 569598764
## 7 Angola 2488490
## 8 Antigua and Barbuda 2880269
## 9 Argentina 27251130
## 10 Armenia 3546073
## # ℹ 235 more rows
From above, we see the total sum of values for each area from the data. We can arrange this data in increasing order to see the areas with the highest values.
desc_agg_val <- agg_val %>% arrange(desc(total_value))
desc_agg_val
## # A tibble: 245 × 2
## Area total_value
## <chr> <dbl>
## 1 World 2532519493
## 2 Asia 1227360081
## 3 Eastern Asia 876570254
## 4 China, mainland 700621017
## 5 Europe 623448480
## 6 Americas 569598764
## 7 Northern America 313640726
## 8 United States of America 291302872
## 9 Eastern Europe 284453965
## 10 Western Europe 172362145
## # ℹ 235 more rows
This gives the obvious result that World has highest value followed by the Continents and some highest populated countries like China and India.
Same can be done to get the areas with least value.
inc_agg_data <- agg_val %>% arrange(total_value)
inc_agg_data
## # A tibble: 245 × 2
## Area total_value
## <chr> <dbl>
## 1 South Sudan 0
## 2 United States Virgin Islands 87585
## 3 Micronesia (Federated States of) 640138
## 4 Pacific Islands Trust Territory 695616
## 5 Barbados 734511
## 6 Sudan 853586
## 7 Saint Pierre and Miquelon 953768
## 8 Liechtenstein 1103525
## 9 Montenegro 1109816
## 10 Eritrea 1123776
## # ℹ 235 more rows
We can check for missing values.
colSums(is.na(livestock))
## Domain Code Domain Area Code Area
## 0 0 0 0
## Element Code Element Item Code Item
## 0 0 0 0
## Year Code Year Unit Value
## 0 0 0 1301
## Flag Flag Description
## 38270 0
From the data, we see that the empty cell in Flag corresponds to “official data” in Flag Description so it cannot be treated as missing value.
We can fix the missing data in Value by replacing with the mean.
livestock$Value[is.na(livestock$Value)] <- mean(livestock$Value, na.rm = TRUE)
Now we check in the egg dataset.
colSums(is.na(egg))
## Domain Code Domain Area Code Area
## 0 0 0 0
## Element Code Element Item Code Item
## 0 0 0 0
## Year Code Year Unit Value
## 0 0 0 40
## Flag Flag Description
## 7548 0
Like before, Flag column is an exception but we can fix the missing data in Value.
egg$Value[is.na(egg$Value)] <- mean(egg$Value, na.rm = TRUE)
We now check if it fixed the problem.
colSums(is.na(livestock))
## Domain Code Domain Area Code Area
## 0 0 0 0
## Element Code Element Item Code Item
## 0 0 0 0
## Year Code Year Unit Value
## 0 0 0 0
## Flag Flag Description
## 38270 0
colSums(is.na(egg))
## Domain Code Domain Area Code Area
## 0 0 0 0
## Element Code Element Item Code Item
## 0 0 0 0
## Year Code Year Unit Value
## 0 0 0 0
## Flag Flag Description
## 7548 0
We see that it fixed all the missing values.
While the values present clear numerical data, it can be hard to compare such large numbers. We can scale them accordingly that can be helped in comparison analysis across various years and quantity.
livestock <- mutate(livestock, Value_Comparison = scale(Value))
head(livestock)
## # A tibble: 6 × 15
## `Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## 2 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## 3 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## 4 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## 5 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## 6 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## # ℹ 8 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## # Value <dbl>, Flag <chr>, `Flag Description` <chr>,
## # Value_Comparison <dbl[,1]>
egg <- mutate(egg, Value_Comparison = scale(Value))
head(egg)
## # A tibble: 6 × 15
## `Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 QL Livestock … 2 Afgh… 5313 Laying 1062
## 2 QL Livestock … 2 Afgh… 5410 Yield 1062
## 3 QL Livestock … 2 Afgh… 5510 Produc… 1062
## 4 QL Livestock … 2 Afgh… 5313 Laying 1062
## 5 QL Livestock … 2 Afgh… 5410 Yield 1062
## 6 QL Livestock … 2 Afgh… 5510 Produc… 1062
## # ℹ 8 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## # Value <dbl>, Flag <chr>, `Flag Description` <chr>,
## # Value_Comparison <dbl[,1]>
We see that adding the new column was successful.
We can perform an inner join by mutual columns for which we have plenty of options in these datasets.
joined_data <- inner_join(livestock, egg, by = c("Area", "Year"))
## Warning in inner_join(livestock, egg, by = c("Area", "Year")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
head(joined_data)
## # A tibble: 6 × 28
## `Domain Code.x` Domain.x `Area Code.x` Area `Element Code.x` Element.x
## <chr> <chr> <dbl> <chr> <dbl> <chr>
## 1 QA Live Animals 2 Afghani… 5111 Stocks
## 2 QA Live Animals 2 Afghani… 5111 Stocks
## 3 QA Live Animals 2 Afghani… 5111 Stocks
## 4 QA Live Animals 2 Afghani… 5111 Stocks
## 5 QA Live Animals 2 Afghani… 5111 Stocks
## 6 QA Live Animals 2 Afghani… 5111 Stocks
## # ℹ 22 more variables: `Item Code.x` <dbl>, Item.x <chr>, `Year Code.x` <dbl>,
## # Year <dbl>, Unit.x <chr>, Value.x <dbl>, Flag.x <chr>,
## # `Flag Description.x` <chr>, Value_Comparison.x <dbl[,1]>,
## # `Domain Code.y` <chr>, Domain.y <chr>, `Area Code.y` <dbl>,
## # `Element Code.y` <dbl>, Element.y <chr>, `Item Code.y` <dbl>, Item.y <chr>,
## # `Year Code.y` <dbl>, Unit.y <chr>, Value.y <dbl>, Flag.y <chr>,
## # `Flag Description.y` <chr>, Value_Comparison.y <dbl[,1]>
We see that the join was successful. We can now analyze certain aspects of this new dataset.
We first get the summary.
summary(joined_data)
## Domain Code.x Domain.x Area Code.x Area
## Length:236827 Length:236827 Min. : 1.0 Length:236827
## Class :character Class :character 1st Qu.: 74.0 Class :character
## Mode :character Mode :character Median : 146.0 Mode :character
## Mean : 943.5
## 3rd Qu.: 221.0
## Max. :5504.0
## Element Code.x Element.x Item Code.x Item.x
## Min. :5111 Length:236827 Min. : 866 Length:236827
## 1st Qu.:5111 Class :character 1st Qu.: 976 Class :character
## Median :5111 Mode :character Median :1034 Mode :character
## Mean :5111 Mean :1018
## 3rd Qu.:5111 3rd Qu.:1096
## Max. :5111 Max. :1126
## Year Code.x Year Unit.x Value.x
## Min. :1961 Min. :1961 Length:236827 Min. :0.000e+00
## 1st Qu.:1976 1st Qu.:1976 Class :character 1st Qu.:1.500e+04
## Median :1991 Median :1991 Mode :character Median :2.670e+05
## Mean :1991 Mean :1991 Mean :1.201e+07
## 3rd Qu.:2005 3rd Qu.:2005 3rd Qu.:2.760e+06
## Max. :2018 Max. :2018 Max. :1.490e+09
## Flag.x Flag Description.x Value_Comparison.x.V1 Domain Code.y
## Length:236827 Length:236827 Min. :-0.180902 Length:236827
## Class :character Class :character 1st Qu.:-0.180668 Class :character
## Mode :character Mode :character Median :-0.176747 Mode :character
## Mean : 0.005905
## 3rd Qu.:-0.137957
## Max. :23.000527
## Domain.y Area Code.y Element Code.y Element.y
## Length:236827 Min. : 1.0 Min. :5313 Length:236827
## Class :character 1st Qu.: 74.0 1st Qu.:5313 Class :character
## Mode :character Median : 146.0 Median :5410 Mode :character
## Mean : 943.5 Mean :5411
## 3rd Qu.: 221.0 3rd Qu.:5510
## Max. :5504.0 Max. :5510
## Item Code.y Item.y Year Code.y Unit.y
## Min. :1062 Length:236827 Min. :1961 Length:236827
## 1st Qu.:1062 Class :character 1st Qu.:1976 Class :character
## Median :1062 Mode :character Median :1991 Mode :character
## Mean :1062 Mean :1991
## 3rd Qu.:1062 3rd Qu.:2005
## Max. :1062 Max. :2018
## Value.y Flag.y Flag Description.y Value_Comparison.y.V1
## Min. : 1 Length:236827 Length:236827 Min. :-0.13055
## 1st Qu.: 4000 Class :character Class :character 1st Qu.:-0.12876
## Median : 39582 Mode :character Mode :character Median :-0.11282
## Mean : 386588 Mean : 0.04268
## 3rd Qu.: 106390 3rd Qu.:-0.08288
## Max. :76769955 Max. :34.27090
We now compare the means.
group_by(joined_data, Area) %>%
summarize(mean_val_livestock = mean(Value.x), mean_val_egg = mean(Value.y))
## # A tibble: 244 × 3
## Area mean_val_livestock mean_val_egg
## <chr> <dbl> <dbl>
## 1 Afghanistan 3597216. 15203.
## 2 Africa 78159910. 646292.
## 3 Albania 532013. 30825.
## 4 Algeria 2575444. 63990.
## 5 American Samoa 5399. 13224.
## 6 Americas 95795716. 3273556.
## 7 Angola 1123117. 14302.
## 8 Antigua and Barbuda 7877. 16553.
## 9 Argentina 12879607. 156616.
## 10 Armenia 167587. 43779.
## # ℹ 234 more rows
We can comparison the scaled value columns that we created.
summary(joined_data$Value_Comparison.x)
## V1
## Min. :-0.180902
## 1st Qu.:-0.180668
## Median :-0.176747
## Mean : 0.005905
## 3rd Qu.:-0.137957
## Max. :23.000527
summary(joined_data$Value_Comparison.y)
## V1
## Min. :-0.13055
## 1st Qu.:-0.12876
## Median :-0.11282
## Mean : 0.04268
## 3rd Qu.:-0.08288
## Max. :34.27090
We first did the descriptive analysis, tidying, and mutating the data. Then we joined these similar datasets and performed some basic operations on the newly joined data.