This challenge will focus on first reading the data set and giving a description of the data with the use of code chunks.
The next step is to then provide grouped summary statistics with the use of dplyr commands and other mathematical terms.
The dataset used for this challenge is the FAO state dataset about livestock. We start by importing the necessary libraries.
library(readr)
library(here)
## here() starts at C:/Users/SHAURYA/Desktop/Studies/Winter 2024 601/Challenges/challenge 2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
We then load the dataset for the use.
fao_csv <- read_csv("FAOSTAT_livestock.csv", show_col_types = FALSE)
We can have a look at the data.
head(fao_csv)
## # A tibble: 6 × 14
## `Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## 2 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## 3 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## 4 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## 5 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## 6 QA Live Anima… 2 Afgh… 5111 Stocks 1107
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## # Value <dbl>, Flag <chr>, `Flag Description` <chr>
We see a variety of columns such as Domain, Area, Year, Item, and Value. It has a mix of both numerical and categorical data which is useful for further analysis.
An explanation of data in each column can be given-
Domain Code - Represents the national website domain of a country. In this case, it is Qatar.
Domain - Describes the relvant domain of the livestock. Here, all the animals are in the Live Animals domain.
Area Code - Represents the code assigned to each country.
Area - Represents the country.
Element Code - Numerical value assigned to the element.
Element - Shows the element for each animal. Here, it is common for all being stocks.
Item Code - Code assigned to every item.
Item - Represents the category of the livestock such as Asses, Buffaloes, Pigs, etc.
Year Code and Year - Redundant columns representing the year.
Unit - Represents the quantity. Here it is common being Head.
Value - Assigns the numerical value of the livestock. It varies across the data.
Flag and Flag Description - It is a symbol and category combination. Such as blank for official data, F for FAO estimate, * for Unofficial Figure, etc.
The dimensions of the data can be obtained by-
dim(fao_csv)
## [1] 82116 14
We see that there are a total of 82116 rows and 14 columns.
We can have a look at all the attributes of the dataset.
colnames(fao_csv)
## [1] "Domain Code" "Domain" "Area Code" "Area"
## [5] "Element Code" "Element" "Item Code" "Item"
## [9] "Year Code" "Year" "Unit" "Value"
## [13] "Flag" "Flag Description"
We can have a look at specific columns of the data using the select function.
select(fao_csv, "Area")
## # A tibble: 82,116 × 1
## Area
## <chr>
## 1 Afghanistan
## 2 Afghanistan
## 3 Afghanistan
## 4 Afghanistan
## 5 Afghanistan
## 6 Afghanistan
## 7 Afghanistan
## 8 Afghanistan
## 9 Afghanistan
## 10 Afghanistan
## # ℹ 82,106 more rows
Multiple columns can also be selected-
select(fao_csv, "Area", "Item", "Value")
## # A tibble: 82,116 × 3
## Area Item Value
## <chr> <chr> <dbl>
## 1 Afghanistan Asses 1300000
## 2 Afghanistan Asses 851850
## 3 Afghanistan Asses 1001112
## 4 Afghanistan Asses 1150000
## 5 Afghanistan Asses 1300000
## 6 Afghanistan Asses 1200000
## 7 Afghanistan Asses 1200000
## 8 Afghanistan Asses 1328000
## 9 Afghanistan Asses 1250000
## 10 Afghanistan Asses 1300000
## # ℹ 82,106 more rows
We see that the often the same item from one area has different prices. A big factor into this is the Year and the Source of data.
Tables can be used for checking the distribution of values within a column. For this, we can first retrieve the particular column.
Livestock_item <- select(fao_csv, Item)
head(Livestock_item)
## # A tibble: 6 × 1
## Item
## <chr>
## 1 Asses
## 2 Asses
## 3 Asses
## 4 Asses
## 5 Asses
## 6 Asses
table(Livestock_item)
## Item
## Asses Buffaloes Camels Cattle Goats Horses Mules Pigs
## 8571 3505 3265 13086 12498 11104 6153 12015
## Sheep
## 11919
From above we get the distribution of the livestock type from the data. We see that cattle appears the highest and Camels the lowest.
We can also get the proportion of each value with the help of proportion tables.
prop.table(table(Livestock_item))
## Item
## Asses Buffaloes Camels Cattle Goats Horses Mules
## 0.10437674 0.04268352 0.03976083 0.15935993 0.15219933 0.13522334 0.07493059
## Pigs Sheep
## 0.14631740 0.14514833
If we want to look at specific portions of data then we can use the filter function. For instance, if we want to see the data from Cuba then we can do the following-
filter(fao_csv, `Area` == "Cuba")
## # A tibble: 406 × 14
## `Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 QA Live Anim… 49 Cuba 5111 Stocks 1107
## 2 QA Live Anim… 49 Cuba 5111 Stocks 1107
## 3 QA Live Anim… 49 Cuba 5111 Stocks 1107
## 4 QA Live Anim… 49 Cuba 5111 Stocks 1107
## 5 QA Live Anim… 49 Cuba 5111 Stocks 1107
## 6 QA Live Anim… 49 Cuba 5111 Stocks 1107
## 7 QA Live Anim… 49 Cuba 5111 Stocks 1107
## 8 QA Live Anim… 49 Cuba 5111 Stocks 1107
## 9 QA Live Anim… 49 Cuba 5111 Stocks 1107
## 10 QA Live Anim… 49 Cuba 5111 Stocks 1107
## # ℹ 396 more rows
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## # Value <dbl>, Flag <chr>, `Flag Description` <chr>
If we want data from 2000 and recent then we can use the filter function for that as well.
filter(fao_csv, `Year` >= 2000)
## # A tibble: 28,639 × 14
## `Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 2 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 3 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 4 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 5 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 6 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 7 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 8 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 9 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 10 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## # ℹ 28,629 more rows
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## # Value <dbl>, Flag <chr>, `Flag Description` <chr>
We can use the group_by function with summarise to get the total value from every area in the data. It will require two steps. First step will be to group the data by the area. Second step involves calculating the aggregate value for each area.
area_data <- fao_csv %>% group_by(Area)
area_data
## # A tibble: 82,116 × 14
## # Groups: Area [253]
## `Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 2 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 3 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 4 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 5 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 6 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 7 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 8 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 9 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 10 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## # ℹ 82,106 more rows
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## # Value <dbl>, Flag <chr>, `Flag Description` <chr>
agg_val <- area_data %>% summarize(total_value = sum(Value, na.rm = TRUE))
agg_val
## # A tibble: 253 × 2
## Area total_value
## <chr> <dbl>
## 1 Afghanistan 1460469747
## 2 Africa 40799473033
## 3 Albania 200351683
## 4 Algeria 1195006104
## 5 American Samoa 626333
## 6 Americas 44449212359
## 7 Angola 390844790
## 8 Antigua and Barbuda 2741177
## 9 Argentina 5229120240
## 10 Armenia 36198797
## # ℹ 243 more rows
From above, we see the total sum of values for each area from the data. We can arrange this data in increasing order to see the areas with the highest values.
desc_agg_val <- agg_val %>% arrange(desc(total_value))
desc_agg_val
## # A tibble: 253 × 2
## Area total_value
## <chr> <dbl>
## 1 World 234880094237
## 2 Asia 99486671329
## 3 Americas 44449212359
## 4 Eastern Asia 41808542032
## 5 Southern Asia 41083828119
## 6 Africa 40799473033
## 7 China, mainland 38149759756
## 8 Europe 37450898501
## 9 South America 26118030796
## 10 India 25378680126
## # ℹ 243 more rows
This gives the obvious result that World has highest value followed by the Continents and some highest populated countries like China and India.
Same can be done to get the areas with least value.
inc_agg_data <- agg_val %>% arrange(total_value)
inc_agg_data
## # A tibble: 253 × 2
## Area total_value
## <chr> <dbl>
## 1 Aruba 7355
## 2 Saint Pierre and Miquelon 11759
## 3 Norfolk Island 24172
## 4 Tokelau 54391
## 5 Niue 113629
## 6 Nauru 130531
## 7 Bermuda 176050
## 8 Cayman Islands 218313
## 9 Saint Helena, Ascension and Tristan da Cunha 257386
## 10 Tuvalu 574876
## # ℹ 243 more rows
We see that the smallest areas tend to be Islands because of their low population and small land mass.
We can now get the statistical insight from the data such as mean, median, mode and dispersion such as standard deviation, min/max/quantile.
Previously, we grouped the data by Area and retrieved the aggregate value of each area. We can use the same logic to get the mean.
area <- fao_csv %>% group_by(Area)
mean_area_val <- area %>% summarize(mean_value = mean(Value, na.rm = TRUE))
mean_area_val
## # A tibble: 253 × 2
## Area mean_value
## <chr> <dbl>
## 1 Afghanistan 3597216.
## 2 Africa 78159910.
## 3 Albania 435547.
## 4 Algeria 2575444.
## 5 American Samoa 5399.
## 6 Americas 95795716.
## 7 Angola 1123117.
## 8 Antigua and Barbuda 7877.
## 9 Argentina 12879607.
## 10 Armenia 167587.
## # ℹ 243 more rows
Although not in the best form, a sizeable difference in size can be noticed among areas such as between American Samoa and Argentina. Several other comparisons can be found among the 253 unique values obtained.
We can use the same steps to get the median value -
median_val <- area %>% summarize(median_value = median(Value, na.rm = TRUE))
median_val
## # A tibble: 253 × 2
## Area median_value
## <chr> <dbl>
## 1 Afghanistan 1300000
## 2 Africa 14488952.
## 3 Albania 107000
## 4 Algeria 202150
## 5 American Samoa 3658.
## 6 Americas 32198658.
## 7 Angola 480000
## 8 Antigua and Barbuda 5000
## 9 Argentina 3595000
## 10 Armenia 13085
## # ℹ 243 more rows
For mode, we can use the Year column to see which year appears the most in the data. Because R does not have a function for it like the mean and median, we have to create our own function.
mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
most_common_year <- mode(fao_csv$Year)
most_common_year
## [1] 2012
From above we see that 2012 appears the most in the data.
We can get the standard deviation of the value for each item from the data.
all_items <- fao_csv %>% group_by(Item)
all_items
## # A tibble: 82,116 × 14
## # Groups: Item [9]
## `Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 2 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 3 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 4 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 5 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 6 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 7 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 8 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 9 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## 10 QA Live Anim… 2 Afgh… 5111 Stocks 1107
## # ℹ 82,106 more rows
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## # Value <dbl>, Flag <chr>, `Flag Description` <chr>
sd_value <- all_items %>% summarize(sd_value = sd(Value, na.rm = TRUE))
sd_value
## # A tibble: 9 × 2
## Item sd_value
## <chr> <dbl>
## 1 Asses 4182216.
## 2 Buffaloes 31904195.
## 3 Camels 4022813.
## 4 Cattle 99740163.
## 5 Goats 58198375.
## 6 Horses 5248621.
## 7 Mules 1555399.
## 8 Pigs 75330884.
## 9 Sheep 87975911.
We see that Cattle has the highest standard deviation value with Mules being the lowest.
For each item, we can get the min and max values -
item_values <- all_items %>% summarize(min_value = min(Value, na.rm = TRUE),
max_value = max(Value, na.rm = TRUE))
item_values
## # A tibble: 9 × 3
## Item min_value max_value
## <chr> <dbl> <dbl>
## 1 Asses 0 50960124
## 2 Buffaloes 0 206600676
## 3 Camels 42 35525270
## 4 Cattle 0 1489744504
## 5 Goats 0 1045915764
## 6 Horses 0 62161208
## 7 Mules 0 15011350
## 8 Pigs 0 991318905
## 9 Sheep 0 1209467079
Min value being 0 shows that certain times the Item has no value except for Camels whose lowest value is 42.
We can also obtain both the 25th and 75th quantiles for value of each item.
item_quantiles <- all_items %>% summarize(q_25 = quantile(Value, probs = 0.25, na.rm = TRUE),
q_75 = quantile(Value, probs = 0.75, na.rm = TRUE))
item_quantiles
## # A tibble: 9 × 3
## Item q_25 q_75
## <chr> <dbl> <dbl>
## 1 Asses 3226. 433154.
## 2 Buffaloes 5392. 2300380
## 3 Camels 18000 983811
## 4 Cattle 64609. 7458150
## 5 Goats 26602. 3095250
## 6 Horses 3000 404736
## 7 Mules 1400 135000
## 8 Pigs 26240. 2904955
## 9 Sheep 38000 8759630
There is an obvious gap between the two quantiles which shows the usual behavior for accurate numerical data.
We went through the basic high level analysis of the data followed by some in depth statistical insight on certain groups within the data. In later challenges, we can focus on various other types of functions and visualizations.