For this challenge, I will be reading in the FATOSTAT egg chicken dataset to keep on theme with the birds from the last challenge.
The dataset and the first rows are displayed below.
egg_data <- read_csv("../challenge_datasets/FAOSTAT_egg_chicken.csv")
## Rows: 38170 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Description
## dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Similar to the last birds.csv dataset, this data seems to be collected by the FAO, hinted at by the FAO estimate description. The years are increasing per country and it seems to be measuring the egg production in each of those countries. In overview, it recorded the number of hens, yield per hen and total production for variety of countries.
# Group it by the tonnes_unit
tonnes_data <- egg_data %>% filter(Unit == "tonnes")
mean_tonnes_data <- tonnes_data %>%
group_by(Area) %>%
summarize(mean_yield = mean(Value))
top_countries <- mean_tonnes_data %>%
filter(Area != "World") %>%
arrange(desc(mean_yield)) %>%
head(5)
print(top_countries)
## # A tibble: 5 × 2
## Area mean_yield
## <chr> <dbl>
## 1 Asia 18896761.
## 2 Eastern Asia 13566855.
## 3 China, mainland 10744941.
## 4 Europe 9783322.
## 5 Americas 8943798.
bottom_countries <- mean_tonnes_data %>%
arrange(mean_yield) %>%
head(5)
print(bottom_countries)
## # A tibble: 5 × 2
## Area mean_yield
## <chr> <dbl>
## 1 Tokelau 6.84
## 2 Tuvalu 13.6
## 3 Saint Pierre and Miquelon 13.6
## 4 Nauru 17.2
## 5 Niue 19.9
I observed that there are different categories for units so to have any meaningful analysis, we must filter it by the different values. In this case, I narrowed the scope down to just the units labled tonnes.
We can observe the top 5 and bottom 5 produced value areas/countries for eggs cumulative over the years. As expected, mostly areas with larger land space tends to have higher production of eggs.
egg_tonnes <- egg_data %>%
filter(!is.na(Value)) %>%
filter(Unit == "tonnes")
egg_tonnes %>%
group_by(Area) %>%
summarize(
mean = mean(Value),
median = median(Value),
sd = sd(Value),
min = min(Value),
max = max(Value),
q25 = quantile(Value, 0.25),
q75 = quantile(Value, 0.75)
)
## # A tibble: 244 × 8
## Area mean median sd min max q25 q75
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 15264. 14325 2589. 1 e4 2.24e4 1.36e4 1.68e4
## 2 Africa 1580155. 1470368 919353. 3.92e5 3.31e6 7.59e5 2.24e6
## 3 Albania 17902. 13430 14941. 2.74e3 5.29e4 6.90e3 2.66e4
## 4 Algeria 114104. 94000 113206. 7.5 e3 3.90e5 1.58e4 1.73e5
## 5 American Samoa 31.5 32 5.02 1.8 e1 4.5 e1 3 e1 3.5 e1
## 6 Americas 8943798. 7986902 3265157. 4.85e6 1.63e7 5.95e6 1.13e7
## 7 Angola 3968. 3900 826. 2.38e3 5.25e3 3.41e3 4.67e3
## 8 Antigua and Barbuda 186. 170 68.0 9.5 e1 3 e2 1.26e2 2.48e2
## 9 Argentina 327012. 275302. 177930. 1.41e5 8.29e5 2.03e5 3.49e5
## 10 Armenia 25861. 28784 9911. 1.04e4 3.82e4 1.60e4 3.47e4
## # ℹ 234 more rows
With this we can observe a variety of data yet again! We see that the standard deviation is very high for countries with high production in general. Dividing the standard deviation by the mean production could make this list a lot more helpful on the size differences. There is a vast difference between Max and Min between the countries showing the dominating produce/agricultural economy for the larger countries.