Challenge 2

Introduction

This challenge will focus on first reading the data set and giving a description of the data with the use of code chunks.

The next step is to then provide grouped summary statistics with the use of dplyr commands and other mathematical terms.

Dataset

The dataset used for this challenge is the FAO state dataset about livestock. We start by importing the necessary libraries.

library(readr)
library(here)

## here() starts at C:/Users/SHAURYA/Desktop/Studies/Winter 2024 601/Challenges/challenge 2

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

We then load the dataset for the use.

fao_csv <- read_csv("FAOSTAT_livestock.csv", show_col_types = FALSE)

Reading the Data

We can have a look at the data.

head(fao_csv)

## # A tibble: 6 × 14
##   `Domain Code` Domain      `Area Code` Area  `Element Code` Element `Item Code`
##   <chr>         <chr>             <dbl> <chr>          <dbl> <chr>         <dbl>
## 1 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## 2 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## 3 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## 4 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## 5 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## 6 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>

We see a variety of columns such as Domain, Area, Year, Item, and Value. It has a mix of both numerical and categorical data which is useful for further analysis.

An explanation of data in each column can be given-

Domain Code - Represents the national website domain of a country. In this case, it is Qatar.

Domain - Describes the relvant domain of the livestock. Here, all the animals are in the Live Animals domain.

Area Code - Represents the code assigned to each country.

Area - Represents the country.

Element Code - Numerical value assigned to the element.

Element - Shows the element for each animal. Here, it is common for all being stocks.

Item Code - Code assigned to every item.

Item - Represents the category of the livestock such as Asses, Buffaloes, Pigs, etc.

Year Code and Year - Redundant columns representing the year.

Unit - Represents the quantity. Here it is common being Head.

Value - Assigns the numerical value of the livestock. It varies across the data.

Flag and Flag Description - It is a symbol and category combination. Such as blank for official data, F for FAO estimate, * for Unofficial Figure, etc.

The dimensions of the data can be obtained by-

dim(fao_csv)

## [1] 82116    14

We see that there are a total of 82116 rows and 14 columns.

We can have a look at all the attributes of the dataset.

colnames(fao_csv)

##  [1] "Domain Code"      "Domain"           "Area Code"        "Area"            
##  [5] "Element Code"     "Element"          "Item Code"        "Item"            
##  [9] "Year Code"        "Year"             "Unit"             "Value"           
## [13] "Flag"             "Flag Description"

We can have a look at specific columns of the data using the select function.

select(fao_csv, "Area")

## # A tibble: 82,116 × 1
##    Area       
##    <chr>      
##  1 Afghanistan
##  2 Afghanistan
##  3 Afghanistan
##  4 Afghanistan
##  5 Afghanistan
##  6 Afghanistan
##  7 Afghanistan
##  8 Afghanistan
##  9 Afghanistan
## 10 Afghanistan
## # ℹ 82,106 more rows

Multiple columns can also be selected-

select(fao_csv, "Area", "Item", "Value")

## # A tibble: 82,116 × 3
##    Area        Item    Value
##    <chr>       <chr>   <dbl>
##  1 Afghanistan Asses 1300000
##  2 Afghanistan Asses  851850
##  3 Afghanistan Asses 1001112
##  4 Afghanistan Asses 1150000
##  5 Afghanistan Asses 1300000
##  6 Afghanistan Asses 1200000
##  7 Afghanistan Asses 1200000
##  8 Afghanistan Asses 1328000
##  9 Afghanistan Asses 1250000
## 10 Afghanistan Asses 1300000
## # ℹ 82,106 more rows

We see that the often the same item from one area has different prices. A big factor into this is the Year and the Source of data.

Tables can be used for checking the distribution of values within a column. For this, we can first retrieve the particular column.

Livestock_item <- select(fao_csv, Item)
head(Livestock_item)

## # A tibble: 6 × 1
##   Item 
##   <chr>
## 1 Asses
## 2 Asses
## 3 Asses
## 4 Asses
## 5 Asses
## 6 Asses

table(Livestock_item)

## Item
##     Asses Buffaloes    Camels    Cattle     Goats    Horses     Mules      Pigs 
##      8571      3505      3265     13086     12498     11104      6153     12015 
##     Sheep 
##     11919

From above we get the distribution of the livestock type from the data. We see that cattle appears the highest and Camels the lowest.

We can also get the proportion of each value with the help of proportion tables.

prop.table(table(Livestock_item))

## Item
##      Asses  Buffaloes     Camels     Cattle      Goats     Horses      Mules 
## 0.10437674 0.04268352 0.03976083 0.15935993 0.15219933 0.13522334 0.07493059 
##       Pigs      Sheep 
## 0.14631740 0.14514833

Transforming Data

If we want to look at specific portions of data then we can use the filter function. For instance, if we want to see the data from Cuba then we can do the following-

filter(fao_csv, `Area` == "Cuba")

## # A tibble: 406 × 14
##    `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
##    <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
##  1 QA            Live Anim…          49 Cuba            5111 Stocks         1107
##  2 QA            Live Anim…          49 Cuba            5111 Stocks         1107
##  3 QA            Live Anim…          49 Cuba            5111 Stocks         1107
##  4 QA            Live Anim…          49 Cuba            5111 Stocks         1107
##  5 QA            Live Anim…          49 Cuba            5111 Stocks         1107
##  6 QA            Live Anim…          49 Cuba            5111 Stocks         1107
##  7 QA            Live Anim…          49 Cuba            5111 Stocks         1107
##  8 QA            Live Anim…          49 Cuba            5111 Stocks         1107
##  9 QA            Live Anim…          49 Cuba            5111 Stocks         1107
## 10 QA            Live Anim…          49 Cuba            5111 Stocks         1107
## # ℹ 396 more rows
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>

If we want data from 2000 and recent then we can use the filter function for that as well.

filter(fao_csv, `Year` >= 2000)

## # A tibble: 28,639 × 14
##    `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
##    <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
##  1 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  2 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  3 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  4 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  5 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  6 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  7 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  8 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  9 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
## 10 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
## # ℹ 28,629 more rows
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>

We can use the group_by function with summarise to get the total value from every area in the data. It will require two steps. First step will be to group the data by the area. Second step involves calculating the aggregate value for each area.

area_data <- fao_csv %>% group_by(Area)
area_data

## # A tibble: 82,116 × 14
## # Groups:   Area [253]
##    `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
##    <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
##  1 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  2 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  3 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  4 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  5 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  6 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  7 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  8 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  9 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
## 10 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
## # ℹ 82,106 more rows
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>

agg_val <- area_data %>% summarize(total_value = sum(Value, na.rm = TRUE))
agg_val

## # A tibble: 253 × 2
##    Area                total_value
##    <chr>                     <dbl>
##  1 Afghanistan          1460469747
##  2 Africa              40799473033
##  3 Albania               200351683
##  4 Algeria              1195006104
##  5 American Samoa           626333
##  6 Americas            44449212359
##  7 Angola                390844790
##  8 Antigua and Barbuda     2741177
##  9 Argentina            5229120240
## 10 Armenia                36198797
## # ℹ 243 more rows

From above, we see the total sum of values for each area from the data. We can arrange this data in increasing order to see the areas with the highest values.

desc_agg_val <- agg_val %>% arrange(desc(total_value))
desc_agg_val

## # A tibble: 253 × 2
##    Area             total_value
##    <chr>                  <dbl>
##  1 World           234880094237
##  2 Asia             99486671329
##  3 Americas         44449212359
##  4 Eastern Asia     41808542032
##  5 Southern Asia    41083828119
##  6 Africa           40799473033
##  7 China, mainland  38149759756
##  8 Europe           37450898501
##  9 South America    26118030796
## 10 India            25378680126
## # ℹ 243 more rows

This gives the obvious result that World has highest value followed by the Continents and some highest populated countries like China and India.

Same can be done to get the areas with least value.

inc_agg_data <- agg_val %>% arrange(total_value)
inc_agg_data

## # A tibble: 253 × 2
##    Area                                         total_value
##    <chr>                                              <dbl>
##  1 Aruba                                               7355
##  2 Saint Pierre and Miquelon                          11759
##  3 Norfolk Island                                     24172
##  4 Tokelau                                            54391
##  5 Niue                                              113629
##  6 Nauru                                             130531
##  7 Bermuda                                           176050
##  8 Cayman Islands                                    218313
##  9 Saint Helena, Ascension and Tristan da Cunha      257386
## 10 Tuvalu                                            574876
## # ℹ 243 more rows

We see that the smallest areas tend to be Islands because of their low population and small land mass.

Statistical Insight

We can now get the statistical insight from the data such as mean, median, mode and dispersion such as standard deviation, min/max/quantile.

Previously, we grouped the data by Area and retrieved the aggregate value of each area. We can use the same logic to get the mean.

area <- fao_csv %>% group_by(Area)

mean_area_val <- area %>% summarize(mean_value = mean(Value, na.rm = TRUE))

mean_area_val

## # A tibble: 253 × 2
##    Area                mean_value
##    <chr>                    <dbl>
##  1 Afghanistan           3597216.
##  2 Africa               78159910.
##  3 Albania                435547.
##  4 Algeria               2575444.
##  5 American Samoa           5399.
##  6 Americas             95795716.
##  7 Angola                1123117.
##  8 Antigua and Barbuda      7877.
##  9 Argentina            12879607.
## 10 Armenia                167587.
## # ℹ 243 more rows

Although not in the best form, a sizeable difference in size can be noticed among areas such as between American Samoa and Argentina. Several other comparisons can be found among the 253 unique values obtained.

We can use the same steps to get the median value -

median_val <- area %>% summarize(median_value = median(Value, na.rm = TRUE))
median_val

## # A tibble: 253 × 2
##    Area                median_value
##    <chr>                      <dbl>
##  1 Afghanistan             1300000 
##  2 Africa                 14488952.
##  3 Albania                  107000 
##  4 Algeria                  202150 
##  5 American Samoa             3658.
##  6 Americas               32198658.
##  7 Angola                   480000 
##  8 Antigua and Barbuda        5000 
##  9 Argentina               3595000 
## 10 Armenia                   13085 
## # ℹ 243 more rows

For mode, we can use the Year column to see which year appears the most in the data. Because R does not have a function for it like the mean and median, we have to create our own function.

mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

most_common_year <- mode(fao_csv$Year)
most_common_year

## [1] 2012

From above we see that 2012 appears the most in the data.

We can get the standard deviation of the value for each item from the data.

all_items <- fao_csv %>% group_by(Item)
all_items

## # A tibble: 82,116 × 14
## # Groups:   Item [9]
##    `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
##    <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
##  1 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  2 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  3 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  4 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  5 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  6 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  7 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  8 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
##  9 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
## 10 QA            Live Anim…           2 Afgh…           5111 Stocks         1107
## # ℹ 82,106 more rows
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>

sd_value <- all_items %>% summarize(sd_value = sd(Value, na.rm = TRUE))
sd_value

## # A tibble: 9 × 2
##   Item       sd_value
##   <chr>         <dbl>
## 1 Asses      4182216.
## 2 Buffaloes 31904195.
## 3 Camels     4022813.
## 4 Cattle    99740163.
## 5 Goats     58198375.
## 6 Horses     5248621.
## 7 Mules      1555399.
## 8 Pigs      75330884.
## 9 Sheep     87975911.

We see that Cattle has the highest standard deviation value with Mules being the lowest.

For each item, we can get the min and max values -

item_values <- all_items %>% summarize(min_value = min(Value, na.rm = TRUE),
                                       max_value = max(Value, na.rm = TRUE))

item_values

## # A tibble: 9 × 3
##   Item      min_value  max_value
##   <chr>         <dbl>      <dbl>
## 1 Asses             0   50960124
## 2 Buffaloes         0  206600676
## 3 Camels           42   35525270
## 4 Cattle            0 1489744504
## 5 Goats             0 1045915764
## 6 Horses            0   62161208
## 7 Mules             0   15011350
## 8 Pigs              0  991318905
## 9 Sheep             0 1209467079

Min value being 0 shows that certain times the Item has no value except for Camels whose lowest value is 42.

We can also obtain both the 25th and 75th quantiles for value of each item.

item_quantiles <- all_items %>% summarize(q_25 = quantile(Value, probs = 0.25, na.rm = TRUE),
                                         q_75 = quantile(Value, probs = 0.75, na.rm = TRUE))

item_quantiles

## # A tibble: 9 × 3
##   Item        q_25     q_75
##   <chr>      <dbl>    <dbl>
## 1 Asses      3226.  433154.
## 2 Buffaloes  5392. 2300380 
## 3 Camels    18000   983811 
## 4 Cattle    64609. 7458150 
## 5 Goats     26602. 3095250 
## 6 Horses     3000   404736 
## 7 Mules      1400   135000 
## 8 Pigs      26240. 2904955 
## 9 Sheep     38000  8759630

There is an obvious gap between the two quantiles which shows the usual behavior for accurate numerical data.

Conclusion

We went through the basic high level analysis of the data followed by some in depth statistical insight on certain groups within the data. In later challenges, we can focus on various other types of functions and visualizations.