Code
library(here)
library(readr)
library(knitr)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)library(here)
library(readr)
library(knitr)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)Today’s challenge is to
read in a dataset, and
describe the dataset using both words and any supporting information (e.g., tables, etc)
Read in one (or more) of the following data sets, using the correct R package and command.
You should have already downloaded the datasets from Google Classroom and stored them in a common directory on your computer.
In this challenge, as in all subsequent challenges, the number of stars corresponds to the difficulty of the dataset. You are only required to do the challenge on one dataset, though you are welcome to do it with multiple datasets.
In general, I encourage you to “challenge” yourself by trying to work with a dataset above your experience.
railroad_2012_clean_county.csv ⭐birds.csv ⭐⭐FAOstat\*.csv ⭐⭐wild_bird_data.xlsx ⭐⭐⭐StateCounty2012.xls ⭐⭐⭐⭐Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
The working directory for RStudio has been set such that “birds.csv” can be found at the root of the working directory using the setwd() method.
birds <- read_csv(here("birds.csv"))
birds# A tibble: 30,977 × 14
`Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 QA Live Anim… 2 Afgh… 5112 Stocks 1057
2 QA Live Anim… 2 Afgh… 5112 Stocks 1057
3 QA Live Anim… 2 Afgh… 5112 Stocks 1057
4 QA Live Anim… 2 Afgh… 5112 Stocks 1057
5 QA Live Anim… 2 Afgh… 5112 Stocks 1057
6 QA Live Anim… 2 Afgh… 5112 Stocks 1057
7 QA Live Anim… 2 Afgh… 5112 Stocks 1057
8 QA Live Anim… 2 Afgh… 5112 Stocks 1057
9 QA Live Anim… 2 Afgh… 5112 Stocks 1057
10 QA Live Anim… 2 Afgh… 5112 Stocks 1057
# ℹ 30,967 more rows
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
# Value <dbl>, Flag <chr>, `Flag Description` <chr>
The data set comprises of 30,977 rows with 14 columns.
birds# A tibble: 30,977 × 14
`Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 QA Live Anim… 2 Afgh… 5112 Stocks 1057
2 QA Live Anim… 2 Afgh… 5112 Stocks 1057
3 QA Live Anim… 2 Afgh… 5112 Stocks 1057
4 QA Live Anim… 2 Afgh… 5112 Stocks 1057
5 QA Live Anim… 2 Afgh… 5112 Stocks 1057
6 QA Live Anim… 2 Afgh… 5112 Stocks 1057
7 QA Live Anim… 2 Afgh… 5112 Stocks 1057
8 QA Live Anim… 2 Afgh… 5112 Stocks 1057
9 QA Live Anim… 2 Afgh… 5112 Stocks 1057
10 QA Live Anim… 2 Afgh… 5112 Stocks 1057
# ℹ 30,967 more rows
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
# Value <dbl>, Flag <chr>, `Flag Description` <chr>
The data set has a total of 8 <chr> type columns and the remaining 6 columns are of the <dbl> type. The birds variable contains the entire dataset.
The dataset seems to provide a count of live animals within areas (identified by an area code) measured over multiple years. The Flag and Flag Description columns highlight how the estimate was likely obtained.
We can create multiple summaries over the head count values comprising of the mean, median, variance and standard deviation grouped by Area, Item, Year and Flag.
Since the Value column contains “NA” values we remove these during each of the summaries below.
The following is the summary tibble for the metrics on all kinds of livestock when grouped by the Item.
birds %>%
group_by(Item) %>%
summarize(mean_count=mean(Value,na.rm=T),
median_count=median(Value,na.rm=T),
var_count=var(Value,na.rm=T),
sd_count=sd(Value,na.rm=T))# A tibble: 5 × 5
Item mean_count median_count var_count sd_count
<chr> <dbl> <dbl> <dbl> <dbl>
1 Chickens 207931. 10784. 1.17e12 1081629.
2 Ducks 23072. 510 1.22e10 110621.
3 Geese and guinea fowls 10292. 258 1.98e 9 44489.
4 Pigeons, other birds 6163. 2800 7.19e 7 8481.
5 Turkeys 15228. 528 3.18e 9 56416.
The following is the summary tibble for the metrics on all kinds of livestock when grouped by the Area.
birds %>%
group_by(Area) %>%
summarize(mean_count=mean(Value,na.rm=T),
median_count=median(Value,na.rm=T),
var_count=var(Value,na.rm=T),
sd_count=sd(Value,na.rm=T))# A tibble: 248 × 5
Area mean_count median_count var_count sd_count
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 8099. 6700 7.95e 6 2819.
2 Africa 196561. 12910. 1.90e11 435740.
3 Albania 2278. 1300 5.14e 6 2268.
4 Algeria 17621. 42.5 1.51e 9 38830.
5 American Samoa 41.4 38 1.97e 2 14.0
6 Americas 856356. 66924. 2.37e12 1539316.
7 Angola 9453. 6075 7.97e 7 8928.
8 Antigua and Barbuda 93.6 85 1.57e 3 39.7
9 Argentina 18844. 2355 1.13e 9 33620.
10 Armenia 2062. 1528. 4.18e 6 2044.
# ℹ 238 more rows
This can also be further grouped by the “Item” column:
birds %>%
group_by(Area, Item) %>%
summarize(mean_count=mean(Value,na.rm=T),
median_count=median(Value,na.rm=T),
var_count=var(Value,na.rm=T),
sd_count=sd(Value,na.rm=T))# A tibble: 601 × 6
# Groups: Area [248]
Area Item mean_count median_count var_count sd_count
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Chickens 8099. 6700 7.95e 6 2819.
2 Africa Chickens 936779. 865156. 2.65e11 515260.
3 Africa Ducks 13639. 12557 3.47e 7 5888.
4 Africa Geese and guinea fowls 12164. 8192. 6.49e 7 8058.
5 Africa Pigeons, other birds 11222. 9946. 7.50e 7 8659.
6 Africa Turkeys 9004. 5496 7.28e 7 8534.
7 Albania Chickens 4055. 3820. 4.28e 6 2069.
8 Albania Ducks 558. 410. 8.77e 4 296.
9 Albania Geese and guinea fowls 396. 278. 5.40e 4 232.
10 Albania Turkeys 750. 674 8.47e 4 291.
# ℹ 591 more rows
The following is the summary tibble for the metrics on all kinds of livestock when grouped by the Year.
birds %>%
group_by(Year) %>%
summarize(mean_count=mean(Value,na.rm=T),
median_count=median(Value,na.rm=T),
var_count=var(Value,na.rm=T),
sd_count=sd(Value,na.rm=T))# A tibble: 58 × 5
Year mean_count median_count var_count sd_count
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1961 36752. 1033 47059099105. 216931.
2 1962 37787. 1014 50595809286. 224935.
3 1963 38736. 1106 53353928347. 230985.
4 1964 39325. 1103 54806462166. 234108.
5 1965 40334. 1104 57858203063. 240537.
6 1966 41229. 1088. 60307571742. 245576.
7 1967 43240. 1193 66353492122. 257592.
8 1968 44420. 1252. 70623262796. 265750.
9 1969 45607. 1267 75005569765. 273871.
10 1970 47706. 1259 81653504019. 285751.
# ℹ 48 more rows
Similarly, a grouping by Area, Item and Year can also be created:
birds %>%
group_by(Area, Item, Year) %>%
summarize(mean_count=mean(Value,na.rm=T),
median_count=median(Value,na.rm=T),
var_count=var(Value,na.rm=T),
sd_count=sd(Value,na.rm=T))# A tibble: 30,977 × 7
# Groups: Area, Item [601]
Area Item Year mean_count median_count var_count sd_count
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Chickens 1961 4700 4700 NA NA
2 Afghanistan Chickens 1962 4900 4900 NA NA
3 Afghanistan Chickens 1963 5000 5000 NA NA
4 Afghanistan Chickens 1964 5300 5300 NA NA
5 Afghanistan Chickens 1965 5500 5500 NA NA
6 Afghanistan Chickens 1966 5800 5800 NA NA
7 Afghanistan Chickens 1967 6600 6600 NA NA
8 Afghanistan Chickens 1968 6290 6290 NA NA
9 Afghanistan Chickens 1969 6300 6300 NA NA
10 Afghanistan Chickens 1970 6000 6000 NA NA
# ℹ 30,967 more rows
An observation from the above summary is that this is equivalent to the non-grouped data since each area has a single item count value per year.
The following is the summary tibble for the metrics on all kinds of livestock when grouped by the Year.
birds %>%
group_by(Flag, `Flag Description`) %>%
summarize(mean_count=mean(Value,na.rm=T),
median_count=median(Value,na.rm=T),
var_count=var(Value,na.rm=T),
sd_count=sd(Value,na.rm=T))# A tibble: 6 × 6
# Groups: Flag [6]
Flag `Flag Description` mean_count median_count var_count sd_count
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 * Unofficial figure 116743. 1952 2.56e11 506093.
2 A Aggregate, may include offic… 345885. 8003 2.21e12 1485598.
3 F FAO estimate 16578. 465 1.36e10 116605.
4 Im FAO data based on imputation… 47559. 501 1.06e11 324915.
5 M Data not available NaN NA NA NA
6 <NA> Official data 32128. 2500 1.35e10 116368.
The NA flag denotes estimates collected through “Official” sources, whereas the * flag denotes estimates collected through “Unofficial” sources.
The dataset values are a unit of “1000 Heads”. The Value column can be multiplied by 1000 to get the actual values and a summary tibble can be created for the mutatated dataset.
birds_actuals <- birds %>% mutate(Actual_Value=Value*1000)The mutated dataset with the Actual_Value column is stored in the birds_actuals variable to be used in the subsequent summaries.
birds_actuals %>%
group_by(Area) %>%
summarize(mean_count=mean(Actual_Value,na.rm=T),
median_count=median(Actual_Value,na.rm=T),
var_count=var(Actual_Value,na.rm=T),
sd_count=sd(Actual_Value,na.rm=T))# A tibble: 248 × 5
Area mean_count median_count var_count sd_count
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 8098741. 6700000 7.95e12 2819040.
2 Africa 196561452. 12909500 1.90e17 435739643.
3 Albania 2278429. 1300000 5.14e12 2267911.
4 Algeria 17621159. 42500 1.51e15 38829507.
5 American Samoa 41414. 38000 1.97e 8 14039.
6 Americas 856355944. 66923500 2.37e18 1539315549.
7 Angola 9452759. 6075000 7.97e13 8927939.
8 Antigua and Barbuda 93638. 85000 1.57e 9 39661.
9 Argentina 18844336. 2355000 1.13e15 33620317.
10 Armenia 2061556. 1527500 4.18e12 2043773.
# ℹ 238 more rows
birds_actuals %>%
group_by(Year) %>%
summarize(mean_count=mean(Actual_Value,na.rm=T),
median_count=median(Actual_Value,na.rm=T),
var_count=var(Actual_Value,na.rm=T),
sd_count=sd(Actual_Value,na.rm=T))# A tibble: 58 × 5
Year mean_count median_count var_count sd_count
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1961 36751736. 1033000 4.71e16 216931093.
2 1962 37786802. 1014000 5.06e16 224935122.
3 1963 38735652. 1106000 5.34e16 230984693.
4 1964 39325340. 1103000 5.48e16 234107800.
5 1965 40333935. 1104000 5.79e16 240537322.
6 1966 41229319. 1087500 6.03e16 245576000.
7 1967 43240246. 1193000 6.64e16 257591716.
8 1968 44419737. 1252500 7.06e16 265750377.
9 1969 45607031. 1267000 7.50e16 273871448.
10 1970 47706244. 1259000 8.17e16 285750773.
# ℹ 48 more rows
birds_actuals %>%
group_by(Item) %>%
summarize(mean_count=mean(Actual_Value,na.rm=T),
median_count=median(Actual_Value,na.rm=T),
var_count=var(Actual_Value,na.rm=T),
sd_count=sd(Actual_Value,na.rm=T))# A tibble: 5 × 5
Item mean_count median_count var_count sd_count
<chr> <dbl> <dbl> <dbl> <dbl>
1 Chickens 207930808. 10783500 1.17e18 1081628940.
2 Ducks 23071673. 510000 1.22e16 110620641.
3 Geese and guinea fowls 10291937. 258000 1.98e15 44489477.
4 Pigeons, other birds 6163375. 2800000 7.19e13 8480750.
5 Turkeys 15227919. 528000 3.18e15 56415943.
birds_actuals %>%
group_by(Flag, `Flag Description`) %>%
summarize(mean_count=mean(Actual_Value,na.rm=T),
median_count=median(Actual_Value,na.rm=T),
var_count=var(Actual_Value,na.rm=T),
sd_count=sd(Actual_Value,na.rm=T))# A tibble: 6 × 6
# Groups: Flag [6]
Flag `Flag Description` mean_count median_count var_count sd_count
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 * Unofficial figure 116743172. 1952000 2.56e17 5.06e8
2 A Aggregate, may include offic… 345884752. 8003000 2.21e18 1.49e9
3 F FAO estimate 16578255. 465000 1.36e16 1.17e8
4 Im FAO data based on imputation… 47559406. 501000 1.06e17 3.25e8
5 M Data not available NaN NA NA NA
6 <NA> Official data 32127637. 2500000 1.35e16 1.16e8