Challenge 1 Instructions

challenge_1
railroads
faostat
wildbirds
Reading in data and creating a post
Author

Sean Conway

Published

December 15, 2024

Code
library(here)
library(readr)
library(knitr)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a dataset, and

  2. describe the dataset using both words and any supporting information (e.g., tables, etc)

Read in the Data

Read in one (or more) of the following data sets, using the correct R package and command.

You should have already downloaded the datasets from Google Classroom and stored them in a common directory on your computer.

In this challenge, as in all subsequent challenges, the number of stars corresponds to the difficulty of the dataset. You are only required to do the challenge on one dataset, though you are welcome to do it with multiple datasets.

In general, I encourage you to “challenge” yourself by trying to work with a dataset above your experience.

  • railroad_2012_clean_county.csv
  • birds.csv ⭐⭐
  • FAOstat\*.csv ⭐⭐
  • wild_bird_data.xlsx ⭐⭐⭐
  • StateCounty2012.xls ⭐⭐⭐⭐

Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Solutions

Reading the Data

The working directory for RStudio has been set such that “birds.csv” can be found at the root of the working directory using the setwd() method.

Code
birds <- read_csv(here("birds.csv"))
birds
# A tibble: 30,977 × 14
   `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
   <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
 1 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 2 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 3 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 4 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 5 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 6 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 7 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 8 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 9 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
10 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
# ℹ 30,967 more rows
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
#   Value <dbl>, Flag <chr>, `Flag Description` <chr>

Data Description

High Level Description

The data set comprises of 30,977 rows with 14 columns.

Code
birds
# A tibble: 30,977 × 14
   `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
   <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
 1 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 2 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 3 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 4 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 5 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 6 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 7 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 8 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 9 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
10 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
# ℹ 30,967 more rows
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
#   Value <dbl>, Flag <chr>, `Flag Description` <chr>

The data set has a total of 8 <chr> type columns and the remaining 6 columns are of the <dbl> type. The birds variable contains the entire dataset.

How was the Data likely collected?

The dataset seems to provide a count of live animals within areas (identified by an area code) measured over multiple years. The Flag and Flag Description columns highlight how the estimate was likely obtained.

Data Summaries

We can create multiple summaries over the head count values comprising of the mean, median, variance and standard deviation grouped by Area, Item, Year and Flag.

Since the Value column contains “NA” values we remove these during each of the summaries below.

Grouped by Item

The following is the summary tibble for the metrics on all kinds of livestock when grouped by the Item.

Code
birds %>% 
  group_by(Item) %>%
  summarize(mean_count=mean(Value,na.rm=T),
            median_count=median(Value,na.rm=T),
            var_count=var(Value,na.rm=T),
            sd_count=sd(Value,na.rm=T))
# A tibble: 5 × 5
  Item                   mean_count median_count var_count sd_count
  <chr>                       <dbl>        <dbl>     <dbl>    <dbl>
1 Chickens                  207931.       10784.   1.17e12 1081629.
2 Ducks                      23072.         510    1.22e10  110621.
3 Geese and guinea fowls     10292.         258    1.98e 9   44489.
4 Pigeons, other birds        6163.        2800    7.19e 7    8481.
5 Turkeys                    15228.         528    3.18e 9   56416.

Grouped by Area

The following is the summary tibble for the metrics on all kinds of livestock when grouped by the Area.

Code
birds %>% 
  group_by(Area) %>%
  summarize(mean_count=mean(Value,na.rm=T),
            median_count=median(Value,na.rm=T),
            var_count=var(Value,na.rm=T),
            sd_count=sd(Value,na.rm=T))
# A tibble: 248 × 5
   Area                mean_count median_count var_count  sd_count
   <chr>                    <dbl>        <dbl>     <dbl>     <dbl>
 1 Afghanistan             8099.        6700     7.95e 6    2819. 
 2 Africa                196561.       12910.    1.90e11  435740. 
 3 Albania                 2278.        1300     5.14e 6    2268. 
 4 Algeria                17621.          42.5   1.51e 9   38830. 
 5 American Samoa            41.4         38     1.97e 2      14.0
 6 Americas              856356.       66924.    2.37e12 1539316. 
 7 Angola                  9453.        6075     7.97e 7    8928. 
 8 Antigua and Barbuda       93.6         85     1.57e 3      39.7
 9 Argentina              18844.        2355     1.13e 9   33620. 
10 Armenia                 2062.        1528.    4.18e 6    2044. 
# ℹ 238 more rows

This can also be further grouped by the “Item” column:

Code
birds %>% 
  group_by(Area, Item) %>%
  summarize(mean_count=mean(Value,na.rm=T),
            median_count=median(Value,na.rm=T),
            var_count=var(Value,na.rm=T),
            sd_count=sd(Value,na.rm=T))
# A tibble: 601 × 6
# Groups:   Area [248]
   Area        Item                   mean_count median_count var_count sd_count
   <chr>       <chr>                       <dbl>        <dbl>     <dbl>    <dbl>
 1 Afghanistan Chickens                    8099.        6700    7.95e 6    2819.
 2 Africa      Chickens                  936779.      865156.   2.65e11  515260.
 3 Africa      Ducks                      13639.       12557    3.47e 7    5888.
 4 Africa      Geese and guinea fowls     12164.        8192.   6.49e 7    8058.
 5 Africa      Pigeons, other birds       11222.        9946.   7.50e 7    8659.
 6 Africa      Turkeys                     9004.        5496    7.28e 7    8534.
 7 Albania     Chickens                    4055.        3820.   4.28e 6    2069.
 8 Albania     Ducks                        558.         410.   8.77e 4     296.
 9 Albania     Geese and guinea fowls       396.         278.   5.40e 4     232.
10 Albania     Turkeys                      750.         674    8.47e 4     291.
# ℹ 591 more rows

Grouped by Year

The following is the summary tibble for the metrics on all kinds of livestock when grouped by the Year.

Code
birds %>% 
  group_by(Year) %>%
  summarize(mean_count=mean(Value,na.rm=T),
            median_count=median(Value,na.rm=T),
            var_count=var(Value,na.rm=T),
            sd_count=sd(Value,na.rm=T))
# A tibble: 58 × 5
    Year mean_count median_count    var_count sd_count
   <dbl>      <dbl>        <dbl>        <dbl>    <dbl>
 1  1961     36752.        1033  47059099105.  216931.
 2  1962     37787.        1014  50595809286.  224935.
 3  1963     38736.        1106  53353928347.  230985.
 4  1964     39325.        1103  54806462166.  234108.
 5  1965     40334.        1104  57858203063.  240537.
 6  1966     41229.        1088. 60307571742.  245576.
 7  1967     43240.        1193  66353492122.  257592.
 8  1968     44420.        1252. 70623262796.  265750.
 9  1969     45607.        1267  75005569765.  273871.
10  1970     47706.        1259  81653504019.  285751.
# ℹ 48 more rows

Similarly, a grouping by Area, Item and Year can also be created:

Code
birds %>% 
  group_by(Area, Item, Year) %>%
  summarize(mean_count=mean(Value,na.rm=T),
            median_count=median(Value,na.rm=T),
            var_count=var(Value,na.rm=T),
            sd_count=sd(Value,na.rm=T))
# A tibble: 30,977 × 7
# Groups:   Area, Item [601]
   Area        Item      Year mean_count median_count var_count sd_count
   <chr>       <chr>    <dbl>      <dbl>        <dbl>     <dbl>    <dbl>
 1 Afghanistan Chickens  1961       4700         4700        NA       NA
 2 Afghanistan Chickens  1962       4900         4900        NA       NA
 3 Afghanistan Chickens  1963       5000         5000        NA       NA
 4 Afghanistan Chickens  1964       5300         5300        NA       NA
 5 Afghanistan Chickens  1965       5500         5500        NA       NA
 6 Afghanistan Chickens  1966       5800         5800        NA       NA
 7 Afghanistan Chickens  1967       6600         6600        NA       NA
 8 Afghanistan Chickens  1968       6290         6290        NA       NA
 9 Afghanistan Chickens  1969       6300         6300        NA       NA
10 Afghanistan Chickens  1970       6000         6000        NA       NA
# ℹ 30,967 more rows

An observation from the above summary is that this is equivalent to the non-grouped data since each area has a single item count value per year.

Grouped by Flag

The following is the summary tibble for the metrics on all kinds of livestock when grouped by the Year.

Code
birds %>% 
  group_by(Flag, `Flag Description`) %>%
  summarize(mean_count=mean(Value,na.rm=T),
            median_count=median(Value,na.rm=T),
            var_count=var(Value,na.rm=T),
            sd_count=sd(Value,na.rm=T))
# A tibble: 6 × 6
# Groups:   Flag [6]
  Flag  `Flag Description`            mean_count median_count var_count sd_count
  <chr> <chr>                              <dbl>        <dbl>     <dbl>    <dbl>
1 *     Unofficial figure                116743.         1952   2.56e11  506093.
2 A     Aggregate, may include offic…    345885.         8003   2.21e12 1485598.
3 F     FAO estimate                      16578.          465   1.36e10  116605.
4 Im    FAO data based on imputation…     47559.          501   1.06e11  324915.
5 M     Data not available                  NaN            NA  NA            NA 
6 <NA>  Official data                     32128.         2500   1.35e10  116368.

The NA flag denotes estimates collected through “Official” sources, whereas the * flag denotes estimates collected through “Unofficial” sources.

Summarizing with actual values

The dataset values are a unit of “1000 Heads”. The Value column can be multiplied by 1000 to get the actual values and a summary tibble can be created for the mutatated dataset.

Code
birds_actuals <- birds %>% mutate(Actual_Value=Value*1000)

The mutated dataset with the Actual_Value column is stored in the birds_actuals variable to be used in the subsequent summaries.

Grouping by Area
Code
birds_actuals %>% 
  group_by(Area) %>%
  summarize(mean_count=mean(Actual_Value,na.rm=T),
            median_count=median(Actual_Value,na.rm=T),
            var_count=var(Actual_Value,na.rm=T),
            sd_count=sd(Actual_Value,na.rm=T))
# A tibble: 248 × 5
   Area                mean_count median_count var_count    sd_count
   <chr>                    <dbl>        <dbl>     <dbl>       <dbl>
 1 Afghanistan           8098741.      6700000   7.95e12    2819040.
 2 Africa              196561452.     12909500   1.90e17  435739643.
 3 Albania               2278429.      1300000   5.14e12    2267911.
 4 Algeria              17621159.        42500   1.51e15   38829507.
 5 American Samoa          41414.        38000   1.97e 8      14039.
 6 Americas            856355944.     66923500   2.37e18 1539315549.
 7 Angola                9452759.      6075000   7.97e13    8927939.
 8 Antigua and Barbuda     93638.        85000   1.57e 9      39661.
 9 Argentina            18844336.      2355000   1.13e15   33620317.
10 Armenia               2061556.      1527500   4.18e12    2043773.
# ℹ 238 more rows
Grouping by Year
Code
birds_actuals %>% 
  group_by(Year) %>%
  summarize(mean_count=mean(Actual_Value,na.rm=T),
            median_count=median(Actual_Value,na.rm=T),
            var_count=var(Actual_Value,na.rm=T),
            sd_count=sd(Actual_Value,na.rm=T))
# A tibble: 58 × 5
    Year mean_count median_count var_count   sd_count
   <dbl>      <dbl>        <dbl>     <dbl>      <dbl>
 1  1961  36751736.      1033000   4.71e16 216931093.
 2  1962  37786802.      1014000   5.06e16 224935122.
 3  1963  38735652.      1106000   5.34e16 230984693.
 4  1964  39325340.      1103000   5.48e16 234107800.
 5  1965  40333935.      1104000   5.79e16 240537322.
 6  1966  41229319.      1087500   6.03e16 245576000.
 7  1967  43240246.      1193000   6.64e16 257591716.
 8  1968  44419737.      1252500   7.06e16 265750377.
 9  1969  45607031.      1267000   7.50e16 273871448.
10  1970  47706244.      1259000   8.17e16 285750773.
# ℹ 48 more rows
Grouping by Item
Code
birds_actuals %>% 
  group_by(Item) %>%
  summarize(mean_count=mean(Actual_Value,na.rm=T),
            median_count=median(Actual_Value,na.rm=T),
            var_count=var(Actual_Value,na.rm=T),
            sd_count=sd(Actual_Value,na.rm=T))
# A tibble: 5 × 5
  Item                   mean_count median_count var_count    sd_count
  <chr>                       <dbl>        <dbl>     <dbl>       <dbl>
1 Chickens               207930808.     10783500   1.17e18 1081628940.
2 Ducks                   23071673.       510000   1.22e16  110620641.
3 Geese and guinea fowls  10291937.       258000   1.98e15   44489477.
4 Pigeons, other birds     6163375.      2800000   7.19e13    8480750.
5 Turkeys                 15227919.       528000   3.18e15   56415943.
Grouping by Flag
Code
birds_actuals %>% 
  group_by(Flag, `Flag Description`) %>%
  summarize(mean_count=mean(Actual_Value,na.rm=T),
            median_count=median(Actual_Value,na.rm=T),
            var_count=var(Actual_Value,na.rm=T),
            sd_count=sd(Actual_Value,na.rm=T))
# A tibble: 6 × 6
# Groups:   Flag [6]
  Flag  `Flag Description`            mean_count median_count var_count sd_count
  <chr> <chr>                              <dbl>        <dbl>     <dbl>    <dbl>
1 *     Unofficial figure             116743172.      1952000   2.56e17   5.06e8
2 A     Aggregate, may include offic… 345884752.      8003000   2.21e18   1.49e9
3 F     FAO estimate                   16578255.       465000   1.36e16   1.17e8
4 Im    FAO data based on imputation…  47559406.       501000   1.06e17   3.25e8
5 M     Data not available                  NaN            NA  NA        NA     
6 <NA>  Official data                  32127637.      2500000   1.35e16   1.16e8