Some screen scraping to automatically download all of the CSV files. The idea is that we can re-use the code by simply changing the current variable
library(tidyverse)
library(purrr)
library(fs)
current <- "2019/2019-01-29"
tidytuesday_path <- "https://github.com/rfordatascience/tidytuesday/tree/master/data/"
tidytuesday_raw <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/"
Using rvest and xml2 to read the GitHub page and extract all of the tables from the HTML code.
library(rvest)
git_folder <- read_html(paste0(tidytuesday_path, current))
tables <- html_table(git_folder)
Two uses of walk(), one to extract only those files with a CSV extension from the tabla (now data frame) named “Name”, and to then download and save each into a folder called data
if(!dir_exists("data")) dir_create("data")
walk(
tables,
~{
cv <- str_detect(.x$Name, ".csv")
fl <- .x$Name[cv]
walk(
fl,
~if(!file_exists(path( "data", .x)))
download.file(
paste0(sourced_raw, "/", .x),
path("data", .x)
)
)
}
)
Recently discovered dir_map() from fs, which is like purrr::map() but to iterate through files. Using that and some rlang functions, we can easily load all of the files into our R environment.
library(rlang)
# Reads the file names and removes unnecessary
# parts for clean variable names
csv_names <- dir_map("data", as.character) %>%
str_remove("data/") %>%
str_remove(".csv")
# Reads the files into memory and names
# each list item
csv_files <- dir_map("data", read_csv) %>%
set_names(csv_names)
# To load each data frame in the list into
# its own variable in our Global Environment
env_bind(global_env(), !!! csv_files)
ls()
[1] "clean_cheese" "csv_files" "csv_names" "current"
[5] "fluid_milk_sales" "git_folder" "milk_products_facts" "milkcow_facts"
[9] "state_milk_production" "tables" "thm" "tidytuesday_path"
[13] "tidytuesday_raw"
The skimr package is a great way to quickly get an idea of the contents of the data. Using map(), and rlang’s env_get(), the csv_names character vector can be used to automatically pull all of the newly created data frame variables in the Global Environment and run them through the skim() function.
library(skimr)
map(
csv_names,
~skim(env_get(global_env(), nm = .x))
)
[[1]]
Skim summary statistics
n obs: 48
n variables: 17
-- Variable type:integer -------------------------------------------------------
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
Year 0 48 48 1993.5 14 1970 1981.75 1993.5 2005.25 2017 ▇▇▇▇▇▇▇▇
-- Variable type:numeric -------------------------------------------------------
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
American Other 0 48 48 2.62 0.65 1.2 2.22 2.58 2.97 3.99 ▁▂▅▇▇▃▁▃
Blue 12 36 48 0.2 0.061 0.15 0.16 0.17 0.18 0.32 ▇▂▁▁▁▁▁▃
Brick 0 48 48 0.054 0.031 0.01 0.03 0.05 0.073 0.12 ▅▇▃▃▃▃▂▂
Cheddar 0 48 48 8.89 1.52 5.79 8.3 9.52 9.92 11.07 ▃▂▁▁▃▆▇▂
Cream and Neufchatel 0 48 48 1.74 0.69 0.61 1.1 2.05 2.35 2.64 ▆▃▃▂▂▃▇▇
Foods and spreads 0 48 48 3.16 0.44 1.91 2.99 3.22 3.44 3.98 ▁▁▂▂▆▇▂▂
Italian other 0 48 48 2.15 0.73 0.87 1.52 2.18 2.75 3.49 ▅▅▅▆▇▃▇▂
Mozzarella 0 48 48 6.86 3.43 1.19 3.18 7.7 9.97 11.73 ▇▅▂▃▅▅▇▇
Muenster 0 48 48 0.34 0.09 0.17 0.28 0.34 0.4 0.53 ▂▆▇▆▆▃▃▂
Other Dairy Cheese 0 48 48 1.02 0.36 0.41 0.75 0.97 1.29 1.59 ▅▅▅▅▂▇▂▇
Processed Cheese 0 48 48 4.33 0.63 3.31 3.82 4.42 4.8 5.44 ▆▆▆▂▆▇▂▅
Swiss 0 48 48 1.16 0.11 0.88 1.07 1.17 1.24 1.35 ▁▂▅▇▃▇▇▃
Total American Chese 0 48 48 11.51 1.95 7 10.81 11.83 12.84 15.06 ▂▂▂▂▇▇▅▁
Total Italian Cheese 0 48 48 9.01 4.15 2.05 4.68 9.95 12.73 15.21 ▇▇▃▃▆▇▇▇
Total Natural Cheese 0 48 48 25.35 7.37 11.37 19.47 26.29 31.78 37.23 ▅▃▂▅▆▆▇▃
Total Processed Cheese Products 0 48 48 7.49 0.87 5.53 6.9 7.59 8.2 8.75 ▂▂▅▇▂▅▇▇
[[2]]
Skim summary statistics
n obs: 387
n variables: 3
-- Variable type:character -----------------------------------------------------
variable missing complete n min max empty n_unique
milk_type 0 387 387 4 21 0 9
-- Variable type:numeric -------------------------------------------------------
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
pounds 0 387 387 1.2e+10 1.7e+10 7.6e+07 8.4e+08 3.9e+09 1.7e+10 5.6e+10 ▇▁▂▁▁▁▁▂
year 0 387 387 1996 12.43 1975 1985 1996 2007 2017 ▇▇▇▇▇▇▇▇
[[3]]
Skim summary statistics
n obs: 35
n variables: 11
-- Variable type:integer -------------------------------------------------------
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
milk_cow_cost_per_animal 0 35 35 1283 294.14 820 1100 1190 1425 1950 ▃▅▇▅▃▂▂▂
milk_per_cow 0 35 35 16962.46 3210.17 11891 14254 16871 19722.5 22259 ▇▅▅▅▅▅▅▇
-- Variable type:numeric -------------------------------------------------------
variable missing complete n mean sd p0 p25 p50
alfalfa_hay_price 0 35 35 104.59 39.13 64.64 79.22 94.02
avg_milk_cow_number 0 35 35 9695742.86 7e+05 9e+06 9171000 9314000
avg_price_milk 0 35 35 0.15 0.028 0.12 0.13 0.14
dairy_ration 0 35 35 0.058 0.022 0.034 0.046 0.049
milk_feed_price_ratio 0 35 35 2.7 0.5 1.52 2.54 2.7
milk_production_lbs 0 35 35 1.6e+11 2.2e+10 1.3e+11 1.4e+11 1.6e+11
milk_volume_to_buy_cow_in_lbs 0 35 35 8848.27 1740.67 6560 7573.53 8625.95
slaughter_cow_price 0 35 35 0.49 0.14 0.33 0.4 0.45
year 0 35 35 1997 10.25 1980 1988.5 1997
p75 p100 hist
109.2 206.08 ▇▇▅▂▁▁▁▂
1e+07 1.1e+07 ▇▂▁▁▁▁▂▂
0.15 0.24 ▇▃▂▁▂▁▁▁
0.059 0.12 ▅▇▁▂▁▁▁▁
3.03 3.64 ▂▁▁▃▇▂▁▂
1.8e+11 2.1e+11 ▃▇▇▂▅▂▃▃
9697.41 13410.85 ▇▇▇▃▃▂▁▂
0.51 1.02 ▇▇▃▁▁▁▁▁
2005.5 2014 ▇▆▆▇▆▆▆▇
[[4]]
Skim summary statistics
n obs: 43
n variables: 18
-- Variable type:integer -------------------------------------------------------
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
fluid_milk 0 43 43 202.91 27.03 149 183 205 223.5 247 ▃▂▅▆▃▂▇▃
year 0 43 43 1996 12.56 1975 1985.5 1996 2006.5 2017 ▇▇▇▇▇▇▇▇
-- Variable type:numeric -------------------------------------------------------
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
butter 0 43 43 4.71 0.43 4.19 4.37 4.54 4.91 5.69 ▇▇▅▃▂▁▃▂
cheese_american 0 43 43 11.95 1.5 8.15 11.28 12.12 12.95 15.06 ▁▂▁▅▆▇▂▁
cheese_cottage 0 43 43 3.13 0.86 2.07 2.56 2.65 4.03 4.63 ▆▇▁▂▁▁▃▃
cheese_other 0 43 43 14.71 4.82 6.13 10.68 15.26 18.96 22.05 ▆▂▂▃▆▅▇▅
dry_buttermilk 0 43 43 0.23 0.054 0.17 0.2 0.2 0.25 0.39 ▅▇▂▁▁▁▁▁
dry_nonfat_milk 0 43 43 3.02 0.53 2.12 2.62 3.05 3.31 4.28 ▃▆▅▇▇▂▁▂
dry_whey 0 43 43 3.05 0.66 1.89 2.4 3.02 3.65 4.09 ▃▇▃▅▃▆▇▅
dry_whole_milk 0 43 43 0.31 0.14 0.095 0.2 0.3 0.4 0.6 ▃▇▂▂▇▁▃▁
evap_cnd_bulk_and_can_skim_milk 0 43 43 4.32 0.82 3.02 3.64 4.24 5.17 5.58 ▇▃▇▆▂▂▇▇
evap_cnd_bulk_whole_milk 0 43 43 0.81 0.29 0.44 0.58 0.7 1.06 1.46 ▇▇▂▃▃▃▂▂
evap_cnd_canned_whole_milk 0 43 43 2.04 0.71 0.94 1.49 1.84 2.34 3.95 ▁▇▃▃▁▂▁▁
fluid_yogurt 0 43 43 7.16 4.34 1.97 3.78 5.87 11.31 14.93 ▇▇▆▂▂▂▂▆
frozen_ice_cream_reduced_fat 0 43 43 6.4 0.43 5.67 6.08 6.33 6.61 7.55 ▂▇▅▇▂▂▁▁
frozen_ice_cream_regular 0 43 43 15.63 1.65 12.47 14.69 15.71 17.06 18.21 ▅▁▂▅▇▂▆▆
frozen_other 0 43 43 3.13 1.37 1.35 2.27 2.91 3.76 6.54 ▅▂▇▂▂▁▂▁
frozen_sherbet 0 43 43 1.14 0.15 0.8 1.11 1.18 1.22 1.36 ▂▂▁▁▃▇▃▂
[[5]]
Skim summary statistics
n obs: 2400
n variables: 4
-- Variable type:character -----------------------------------------------------
variable missing complete n min max empty n_unique
region 0 2400 2400 7 15 0 10
state 0 2400 2400 4 14 0 50
-- Variable type:integer -------------------------------------------------------
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
year 50 2350 2400 1993.36 13.97 1970 1981 1993 2006 2017 ▇▇▇▇▇▇▇▇
-- Variable type:numeric -------------------------------------------------------
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
milk_produced 0 2400 2400 3.1e+09 5.4e+09 3e+06 4.6e+08 1.3e+09 2.7e+09 4.2e+10 ▇▁▁▁▁▁▁▁
Selected the fluid_mil_sales data to create some visualizations. Using dplyr and ggplot2 to get an idea what sales over time look like by type of milk.
fluid_milk_sales %>%
filter(milk_type != "Total Production") %>%
ggplot() +
geom_line(aes(year, pounds / 1e+9, group = milk_type, color = milk_type)) +
scale_color_manual(values = c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")) +
labs(title = "Annual Milk sales by type", y = "Pounds (Billion)", x = "Year")
NA
Isolating just the Whole Milk sales to see what the actual drop in sales have been from its peak to its lowest valley.
whole <- fluid_milk_sales %>%
filter(milk_type == "Whole")
pl <- whole %>%
summarise(
max_pounds = max(pounds) / 1e+9,
min_pounds = min(pounds) / 1e+9,
max_year = sum(ifelse(pounds == min(pounds), year, 0))
) %>%
mutate(diff_pounds = max_pounds-min_pounds)
whole %>%
ggplot() +
geom_area(aes(year, pounds / 1e+9), alpha = 0.5, fill = "#56B4E9") +
labs(title = "Annual Whole Milk sales", y = "Pounds (Billion)", x = "Year") +
geom_errorbar(aes(
x = pl$max_year, ymin = pl$min_pounds,
ymax = pl$max_pounds)) +
geom_text(aes(
x = pl$max_year,
y = pl$diff_pounds/2,
label= paste0(round(pl$diff_pounds, 1), "B drop")
))
It was interesting to determine with the next visualization how the largest drops in the use of Whole Milk occurred during the mid to late 80’s.
fluid_milk_sales %>%
filter(milk_type == "Whole") %>%
mutate(diff_pounds = round((1 - (lag(pounds) / pounds)) * 100)) %>%
ggplot() +
geom_col(aes(year, diff_pounds), alpha = 0.5, fill = "#56B4E9") +
geom_text(aes(year, -0.4, label = diff_pounds), size = 3) +
labs(title = "Annual Whole Milk sales", y = "Year-over-year % difference", x = "Year") +
theme(
axis.text.y = element_blank()
)