Challenge 2

Downloading data

I am downloading the here, dplyr and readr libraries. I am then using read_csv to pull in the FAOSTAT data set on cattle dairy and naming it cattleDairy.

I then use view and head function to top up a new tab of the full file and preview the file respectively

cattleDairy <- read_csv("challenge_datasets/FAOSTAT_cattle_dairy.csv")
## Rows: 36449 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Description
## dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(cattleDairy)
## # A tibble: 6 × 14
##   `Domain Code` Domain      `Area Code` Area  `Element Code` Element `Item Code`
##   <chr>         <chr>             <dbl> <chr>          <dbl> <chr>         <dbl>
## 1 QL            Livestock …           2 Afgh…           5318 Milk A…         882
## 2 QL            Livestock …           2 Afgh…           5420 Yield           882
## 3 QL            Livestock …           2 Afgh…           5510 Produc…         882
## 4 QL            Livestock …           2 Afgh…           5318 Milk A…         882
## 5 QL            Livestock …           2 Afgh…           5420 Yield           882
## 6 QL            Livestock …           2 Afgh…           5510 Produc…         882
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>

Summarize by Element

Milk Animals

I would like to see the descriptive statistic for milk animals. I will pipe from the cattleDairy table, filter for only milk animals, select the value column and summarise_all function to summarise the mean, median and standard deviation. I will use na.rm to remove any NA values left after filtering by milk animals. I realize without fully understanding the data, this may cause misrepresentation of the descriptive statistics.

The standard deviation is very very large. I am not sure if that is an error within my code or because the data is extremely spread out.

cattleDairy %>%
  filter(Element == 'Milk Animals') %>%
  select(Value) %>% 
  summarise_all(list(mean = mean, median = median, sd = sd),na.rm = TRUE)
## # A tibble: 1 × 3
##       mean median        sd
##      <dbl>  <dbl>     <dbl>
## 1 4205410. 295000 18041595.

To make sure the NA values aren’t causing the data to be misrepresented, I will check by filtering by milk animals and then filtering again by is.na for values.

I can see in the flag description that the data is not available so the best representation I have of mean, median and standard deviation is above.

cattleDairy %>% 
 filter(Element == 'Milk Animals') %>% 
  filter(is.na(Value))
## # A tibble: 37 × 14
##    `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
##    <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
##  1 QL            Livestock…         239 Brit…           5318 Milk A…         882
##  2 QL            Livestock…         239 Brit…           5318 Milk A…         882
##  3 QL            Livestock…         239 Brit…           5318 Milk A…         882
##  4 QL            Livestock…         239 Brit…           5318 Milk A…         882
##  5 QL            Livestock…         239 Brit…           5318 Milk A…         882
##  6 QL            Livestock…         239 Brit…           5318 Milk A…         882
##  7 QL            Livestock…         239 Brit…           5318 Milk A…         882
##  8 QL            Livestock…         239 Brit…           5318 Milk A…         882
##  9 QL            Livestock…         239 Brit…           5318 Milk A…         882
## 10 QL            Livestock…         239 Brit…           5318 Milk A…         882
## # ℹ 27 more rows
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>

Mode attempt for milk animals

I want to preemptively state that I am unsure if I found mode accurately, but here is my break down.

I piped in my data of cattleDairy. I filtered for only elements of Milk animals and then grouped by the value column. This is where my understanding gets muddy utilizing the advice from the professor in the slack channel. I understand the summarise will return of the grouped variable (which I put as value). I found that n() function ounts the number of observations in the group. n= is what this count is named. I cannot tell what the ungroup function does as the results come back the same either way. I assume it is best practice to ungroup if more functions are added

This is the function I am not 100% sure what it is doing. It is filtering with a logical statement to filter only the only the rows where the n count is equal to the maximum n count. By the table, I believe the mode is 1000 with 66 rows having that value.

cattleDairy %>% 
  filter(Element == 'Milk Animals') %>% 
  group_by(Value) %>% 
    summarise(n=n()) %>% 
    ungroup() %>%
    filter(n==max(n))
## # A tibble: 1 × 2
##   Value     n
##   <dbl> <int>
## 1  1000    66

Production

The process is the same as for milk animals but now filtering by production

cattleDairy %>%
  filter(Element == 'Production') %>%
  select(Value) %>% 
  summarise_all(list(mean = mean, median = median, sd = sd), na.rm = TRUE)
## # A tibble: 1 × 3
##       mean median        sd
##      <dbl>  <dbl>     <dbl>
## 1 9001419. 295500 40268994.

Yield

The process is the same as for milk animals but now filtering by yield

cattleDairy %>%
  filter(Element == 'Yield') %>%
  select(Value) %>% 
  summarise_all(list(mean = mean, median = median, sd = sd), na.rm = TRUE)
## # A tibble: 1 × 3
##     mean median     sd
##    <dbl>  <dbl>  <dbl>
## 1 19329.  13218 19361.

Quantile for Milk Production

I can see the 75th quantile is much further away from the 50th quantile than the 25th quantile from the 50th quantile. This tells me the data is skewed to the right.

cattleDairy %>%
  filter(Element == 'Milk Animals') %>%
  select(Value) %>% 
  summarise(
    quantile_25 = quantile(Value, 0.25, na.rm = TRUE),
    quantile_50 = quantile(Value, 0.5, na.rm = TRUE),
    quantile_75 = quantile(Value, 0.75, na.rm = TRUE))
## # A tibble: 1 × 3
##   quantile_25 quantile_50 quantile_75
##         <dbl>       <dbl>       <dbl>
## 1       21300      295000     1546533

Minimum and maximum values for Milk Production

To see the largest and smallest number for milk production, I will pipe in cattleDiary and filter by milk animals. I will then summarise with max and min and substract those two values to get the range.

The range is massive. Beyond the scope of this challenge is to dig further to understand why the range is so large.

cattleDairy %>% 
  filter(Element == 'Milk Animals') %>% 
  summarise(maxValue = max(Value, na.rm = TRUE), 
            minValue = min(Value, na.rm = TRUE), 
            rangeValue = maxValue - minValue)
## # A tibble: 1 × 3
##    maxValue minValue rangeValue
##       <dbl>    <dbl>      <dbl>
## 1 276573845        8  276573837