Summary
The body of this paper is written as a short 1st-person skit based in a London news room in early 2001. I am a young data analyst/visualizer for the newspaper.
I am handed a seemlingly simple assignment and it turns out to be trickier than I expected.
I start with a single zip file which contains 27 other zip files, which contain several files each and only need one file from each zip.
During the course of this paper you will learn how to pull only the files you are interested in while also creating references to the source files.
In the end we will create a simple chart with the final data and save it to a pdf file.
Data sources
https://www.gov.uk/government/statistics/family-food-open-data
https://data.gov.uk/dataset/5c1a7a5d-4dd5-4b1b-84f2-3ba8883a07ca/family-food-open-data
https://webarchive.nationalarchives.gov.uk/20130103024837/http://www.defra.gov.uk/statistics/foodfarm/food/familyfood/nationalfoodsurvey/
Inspired by Hadley Wickham’s Joy of Function Programming
Load libraries and empty folders
library(tidyverse)
library(fs)
dir_delete("nfs_zip_files")
dir_create("nfs_zip_files")
dir_delete("nfs_source_files")
dir_create("nfs_source_files")
dir_delete("nfs_meals_out")
dir_create("nfs_meals_out")Act 1
Setting: UK Newsroom in early 2001. I work in their infographics and visualization dept.
Aside: 2001 is like ancient history. I imagine that the news room looked something like this.
BOSS: O’Neil!! I’ve got a new assignment for you. The 2000 National Food Survey data just dropped and I need a chart of how the number of meals that people eat out per week has changed over time. The National Food Survey data is in the xyz folder. Get on it. Pronto!!
ME: Yes sir. Right away sir.
Aside: My Boss likes to call people by their last names and thinks this is a slam dunk. We supposedly have the data and he only needs a simple line graph. What could go wrong?
Act 2
Setting: Me toiling at my desk
What’s in the xyz folder
paths <- dir_ls("xyz/")
paths## xyz/NFSopen_AllData.zip
Oh wonderful, a zip file.
Zip files are the “Mystery Box” of data. You never know what’s inside. It could be a well-formatted csv or it could be anything else.
What’s in the zip file
There is only one file but it’s good to get in the practice of looking at the first element explicitly because it becomes part of a pattern later.
In the unzip function list = TRUE only displays the data and does not actually unzip the files to a folder
x <- paths[[1]]
unzip(x, list = TRUE)## Name Length Date
## 1 NFS_1992.zip 2464072 2016-01-29 14:54:00
## 2 NFS_1993.zip 2750096 2016-01-29 15:03:00
## 3 NFS_1994.zip 2661643 2016-02-02 14:42:00
## 4 NFS_1995.zip 2765155 2016-02-02 14:42:00
## 5 NFS_1996.zip 3048139 2016-02-02 14:43:00
## 6 NFS_1997.zip 2374082 2016-02-02 14:43:00
## 7 NFS_1998.zip 2280058 2016-02-02 14:43:00
## 8 NFS_1999.zip 2275573 2016-02-02 14:44:00
## 9 NFS_2000.zip 2305816 2016-02-02 14:44:00
## 10 NFS_ReferenceCodesDescriptors.zip 3971143 2016-02-17 12:14:00
## 11 NFS_1974.zip 2463461 2016-01-28 19:01:00
## 12 NFS_1975.zip 2571030 2016-01-28 18:16:00
## 13 NFS_1976.zip 2603004 2016-01-28 19:10:00
## 14 NFS_1977.zip 2690012 2016-01-28 19:20:00
## 15 NFS_1978.zip 2596621 2016-01-28 19:28:00
## 16 NFS_1979.zip 2434235 2016-01-29 09:43:00
## 17 NFS_1980.zip 2678547 2016-01-29 10:00:00
## 18 NFS_1981.zip 2592301 2016-02-01 09:04:00
## 19 NFS_1982.zip 2649739 2016-01-29 10:52:00
## 20 NFS_1983.zip 2399506 2016-01-29 10:57:00
## 21 NFS_1984.zip 2322480 2016-01-29 11:33:00
## 22 NFS_1985.zip 2339160 2016-01-29 13:02:00
## 23 NFS_1986.zip 2320710 2016-01-29 13:10:00
## 24 NFS_1987.zip 2387542 2016-01-29 13:21:00
## 25 NFS_1988.zip 2378935 2016-01-29 13:28:00
## 26 NFS_1989.zip 2487599 2016-01-29 14:31:00
## 27 NFS_1990.zip 2233665 2016-01-29 14:37:00
## 28 NFS_1991.zip 2143935 2016-01-29 14:44:00
Ughh!!! More zip files. It looks like one for every year.
Useful trick: Only show file names
Most of the time you don’t want all the extra data like length & date. Use the Name attribute to condense the output.
unzip(x, list = TRUE)$Name## [1] "NFS_1992.zip" "NFS_1993.zip"
## [3] "NFS_1994.zip" "NFS_1995.zip"
## [5] "NFS_1996.zip" "NFS_1997.zip"
## [7] "NFS_1998.zip" "NFS_1999.zip"
## [9] "NFS_2000.zip" "NFS_ReferenceCodesDescriptors.zip"
## [11] "NFS_1974.zip" "NFS_1975.zip"
## [13] "NFS_1976.zip" "NFS_1977.zip"
## [15] "NFS_1978.zip" "NFS_1979.zip"
## [17] "NFS_1980.zip" "NFS_1981.zip"
## [19] "NFS_1982.zip" "NFS_1983.zip"
## [21] "NFS_1984.zip" "NFS_1985.zip"
## [23] "NFS_1986.zip" "NFS_1987.zip"
## [25] "NFS_1988.zip" "NFS_1989.zip"
## [27] "NFS_1990.zip" "NFS_1991.zip"
Unzip to another folder
Specify the destination folder with exdir =. The default for list = is FALSE so we won’t need to include that to unzip.
unzip(x, overwrite = TRUE, exdir = "nfs_zip_files/")
dir_ls("nfs_zip_files/")## nfs_zip_files/NFS_1974.zip
## nfs_zip_files/NFS_1975.zip
## nfs_zip_files/NFS_1976.zip
## nfs_zip_files/NFS_1977.zip
## nfs_zip_files/NFS_1978.zip
## nfs_zip_files/NFS_1979.zip
## nfs_zip_files/NFS_1980.zip
## nfs_zip_files/NFS_1981.zip
## nfs_zip_files/NFS_1982.zip
## nfs_zip_files/NFS_1983.zip
## nfs_zip_files/NFS_1984.zip
## nfs_zip_files/NFS_1985.zip
## nfs_zip_files/NFS_1986.zip
## nfs_zip_files/NFS_1987.zip
## nfs_zip_files/NFS_1988.zip
## nfs_zip_files/NFS_1989.zip
## nfs_zip_files/NFS_1990.zip
## nfs_zip_files/NFS_1991.zip
## nfs_zip_files/NFS_1992.zip
## nfs_zip_files/NFS_1993.zip
## nfs_zip_files/NFS_1994.zip
## nfs_zip_files/NFS_1995.zip
## nfs_zip_files/NFS_1996.zip
## nfs_zip_files/NFS_1997.zip
## nfs_zip_files/NFS_1998.zip
## nfs_zip_files/NFS_1999.zip
## nfs_zip_files/NFS_2000.zip
## nfs_zip_files/NFS_ReferenceCodesDescriptors.zip
Drop the odd file from the paths that we’ll use going forward
There is one file that looks like a data dictionary. We might need that later and can exclude it now for the purpose of data processing.
paths <- dir_ls("nfs_zip_files/")
paths <- setdiff(paths, "nfs_zip_files/NFS_ReferenceCodesDescriptors.zip")
paths## [1] "nfs_zip_files/NFS_1974.zip" "nfs_zip_files/NFS_1975.zip"
## [3] "nfs_zip_files/NFS_1976.zip" "nfs_zip_files/NFS_1977.zip"
## [5] "nfs_zip_files/NFS_1978.zip" "nfs_zip_files/NFS_1979.zip"
## [7] "nfs_zip_files/NFS_1980.zip" "nfs_zip_files/NFS_1981.zip"
## [9] "nfs_zip_files/NFS_1982.zip" "nfs_zip_files/NFS_1983.zip"
## [11] "nfs_zip_files/NFS_1984.zip" "nfs_zip_files/NFS_1985.zip"
## [13] "nfs_zip_files/NFS_1986.zip" "nfs_zip_files/NFS_1987.zip"
## [15] "nfs_zip_files/NFS_1988.zip" "nfs_zip_files/NFS_1989.zip"
## [17] "nfs_zip_files/NFS_1990.zip" "nfs_zip_files/NFS_1991.zip"
## [19] "nfs_zip_files/NFS_1992.zip" "nfs_zip_files/NFS_1993.zip"
## [21] "nfs_zip_files/NFS_1994.zip" "nfs_zip_files/NFS_1995.zip"
## [23] "nfs_zip_files/NFS_1996.zip" "nfs_zip_files/NFS_1997.zip"
## [25] "nfs_zip_files/NFS_1998.zip" "nfs_zip_files/NFS_1999.zip"
## [27] "nfs_zip_files/NFS_2000.zip"
Look inside the first zip file
Hoping for no more zip files. Fingers crossed.
x <- paths[[1]]
unzip(x, list = TRUE)$Name## [1] "1974 household data.txt"
## [2] "1974 mealsout.txt"
## [3] "1974 nutrient conversion factors data.txt"
## [4] "1974 visitor data.txt"
## [5] "1974 diary data.txt"
## [6] "1974 household by pregnancy or under twos.txt"
Sigh of relief. This zip contains several txt files which is probably the data and I see that one of them is called “mealsout”. We are getting closer.
How do I unzip all these files into a single folder?
I’m going to take a minute to explain the next part because it’s the magic in the process.
The map() functions in the purrr package allow you to do bulk processing.
The pattern is:
1. Take a function that you can apply to one element
2. Generalize it
3. Apply it to all the elements
If you try to unzip directly by passing paths it fails with the error below so you have to send the filenames to it one at a time. Works like a for-next loop. Unzip worked on the original zip file because there was only one element in the paths vector. If there were more than one it would have generated this error.
unzip(paths, list = TRUE)$NameError in unzip(paths, list = TRUE) : invalid zip name argument
The basic pattern
Use a function that applies to a single element
In this case the first element is a zip file containing text files that we want to look at all the zips’ filenames.
x <- paths[[1]]
unzip(x, list = TRUE)$NameGeneralized it
Add a tilde(~) in the front and change the first argument, x, to a pronoun by adding a dot in front of it like this, .x. In this way it can sequence over the entire vector of filenames.
~ unzip(.x, list = TRUE)$NameMap the paths to the unzip() function
In English the code below reads: Show the names inside each zip file referred to in the paths vector.
file_names <- map(paths, ~ unzip(.x, list = TRUE)$Name)
file_names[[1]]## [1] "1974 household data.txt"
## [2] "1974 mealsout.txt"
## [3] "1974 nutrient conversion factors data.txt"
## [4] "1974 visitor data.txt"
## [5] "1974 diary data.txt"
## [6] "1974 household by pregnancy or under twos.txt"
It returned a list of 27 elements representing the zip files. The individual elements contain the file names in the zip.
There is a file called mealsout which is what we are looking for. This is a good sign because we can gather only those files and ignore the rest for our analysis.
Put this pattern to use
Create a new dataframe for original source and specific file.
Before we actually unzip these we want to create a reference to the source files because bad things happen when you don’t.
First, do it for one element
x <- paths[[1]]
tibble(path = x, file = unzip(x, list = TRUE)$Name)Then generalize it
~ tibble(path = .x, file = unzip(.x, list = TRUE)$Name)Then apply it
Since we know that we want a dataframe we can use the dataframe-specific map() function so that we don’t have to convert a list to a df.
# Then we make a recipe and use map()
source_files <- map_dfr(paths, ~ tibble(path = .x, files = unzip(.x, list = TRUE)$Name))
source_files## # A tibble: 162 x 2
## path files
## <chr> <chr>
## 1 nfs_zip_files/NFS_1974.zip 1974 household data.txt
## 2 nfs_zip_files/NFS_1974.zip 1974 mealsout.txt
## 3 nfs_zip_files/NFS_1974.zip 1974 nutrient conversion factors data.txt
## 4 nfs_zip_files/NFS_1974.zip 1974 visitor data.txt
## 5 nfs_zip_files/NFS_1974.zip 1974 diary data.txt
## 6 nfs_zip_files/NFS_1974.zip 1974 household by pregnancy or under twos.txt
## 7 nfs_zip_files/NFS_1975.zip 1975 mealsout.txt
## 8 nfs_zip_files/NFS_1975.zip 1975 nutrient conversion factors data.txt
## 9 nfs_zip_files/NFS_1975.zip 1975 visitor data.txt
## 10 nfs_zip_files/NFS_1975.zip 1975 diary data.txt
## # … with 152 more rows
Progress, now we have a dataframe of the source zip files and the source text files. There area total of 162.
How do we know if each year has the same file names?
Very consistent at 27 files each except two files on Nutrition Conversion Factors that add up to 27.
source_files %>%
extract(files, c("year", "name"), "(\\d{4}) (.*)\\.txt") %>%
count(name)## # A tibble: 7 x 2
## name n
## <chr> <int>
## 1 diary data 27
## 2 household by pregnancy or under twos 27
## 3 household data 27
## 4 mealsout 27
## 5 nutrient conversion factors 1
## 6 nutrient conversion factors data 26
## 7 visitor data 27
We are only concerned with meals out and it has the maximum of 27 files. The other files can be ignored.
Unfortunately we can’t get the meals out files without unzipping all of the files first so let’s do that.
Unzip to a new folder
First, let me show you that the destination folder is empty (aka nothing up my sleeve)
dir_ls("nfs_source_files")## character(0)
Extract the first zip to the destination folder
x <- paths[[1]]
unzip(x, exdir = "nfs_source_files")
dir_ls("nfs_source_files")## nfs_source_files/1974 diary data.txt
## nfs_source_files/1974 household by pregnancy or under twos.txt
## nfs_source_files/1974 household data.txt
## nfs_source_files/1974 mealsout.txt
## nfs_source_files/1974 nutrient conversion factors data.txt
## nfs_source_files/1974 visitor data.txt
Success!! Now we can generalize it.
Extract all files to destination folder
Note that walk() is the same as map() except it doesn’t print the output on the console.
walk(paths, ~ unzip(.x, exdir = "nfs_source_files"))
sample(dir("nfs_source_files"), 10, replace = FALSE) # Get 10 random filenames## [1] "1984 diary data.txt"
## [2] "1982 nutrient conversion factors data.txt"
## [3] "1979 mealsout.txt"
## [4] "1989 diary data.txt"
## [5] "1999 mealsout.txt"
## [6] "1983 diary data.txt"
## [7] "1977 household data.txt"
## [8] "1977 diary data.txt"
## [9] "1975 nutrient conversion factors data.txt"
## [10] "1997 mealsout.txt"
All files look to be extracted.
Get only the meals-out files
We’ll use a regular expression to identify meals out files.
Anyone who has worked with zip files and regular expressions in the past will probably have their blood pressure go up a little with this approach. So many things can go wrong. What evs, full speed ahead!!!
Get the paths of the meals-out files
mealsout_paths <- dir_ls("nfs_source_files", regexp = "mealsout")
mealsout_paths## nfs_source_files/1974 mealsout.txt nfs_source_files/1975 mealsout.txt
## nfs_source_files/1976 mealsout.txt nfs_source_files/1977 mealsout.txt
## nfs_source_files/1978 mealsout.txt nfs_source_files/1979 mealsout.txt
## nfs_source_files/1980 mealsout.txt nfs_source_files/1981 mealsout.txt
## nfs_source_files/1982 mealsout.txt nfs_source_files/1983 mealsout.txt
## nfs_source_files/1984 mealsout.txt nfs_source_files/1985 mealsout.txt
## nfs_source_files/1986 mealsout.txt nfs_source_files/1987 mealsout.txt
## nfs_source_files/1988 mealsout.txt nfs_source_files/1989 mealsout.txt
## nfs_source_files/1990 mealsout.txt nfs_source_files/1991 mealsout.txt
## nfs_source_files/1992 mealsout.txt nfs_source_files/1993 mealsout.txt
## nfs_source_files/1994 mealsout.txt nfs_source_files/1995 mealsout.txt
## nfs_source_files/1996 mealsout.txt nfs_source_files/1997 mealsout.txt
## nfs_source_files/1998 mealsout.txt nfs_source_files/1999 mealsout.txt
## nfs_source_files/2000 mealsout.txt
Looks correct. This survey is conducted every two years and the last year we have is 2000.
Show the first few lines of the first file
Hoping there is a header on row 1 and the rest is data.
# What's the first few lines from first file?
x <- mealsout_paths[[1]]
read_lines(x, n_max = 6)## [1] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [2] "20002\t1\t0\t0\t0\t0\t0\t1"
## [3] "20002\t3\t0\t0\t0\t0\t0\t1"
## [4] "20002\t4\t1\t0\t0\t0\t2\t2"
## [5] "20002\t5\t1\t0\t0\t0\t3\t4"
## [6] "20002\t6\t1\t0\t0\t0\t2\t2"
The lucky stars are shining upon us. The first row is the header and the rest are numeric data and possibly all integers. We can also tell that it is a tab-delimited file.
Ensure that all headers are the same
Trust but Verify
# A character vector would be simpler
# The `unname()` function does exactly what it says and removes names attribute
unname(map_chr(mealsout_paths, ~ read_lines(.x, n_max = 1)))## [1] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [2] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [3] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [4] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [5] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [6] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [7] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [8] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [9] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [10] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [11] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [12] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [13] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [14] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [15] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [16] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [17] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [18] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [19] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [20] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [21] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [22] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [23] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [24] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [25] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [26] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [27] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
… or …
table(map_chr(mealsout_paths, ~ read_lines(.x, n_max = 1)))##
## hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso
## 27
PHEW! All file formats are the same.
How do I read in the data for the files we want?
We know it’s tab-delimited so read_tsv() is going to be part of our base function
Bring in only integer columns
This is an interesting pattern and not mandatory for what we are doing and I’m sure you can see the different ways that it can be applied.
Create a recipe for integer columns
col_spec means a specification for the column types you want.
column_specification <- cols(.default = col_integer())
class(column_specification)## [1] "col_spec"
Apply col_spec while reading in file The col_types argument can accept a col_spec object.
View the data
We haven’t assigned the data to an object yet.
read_tsv(x, col_types = column_specification)## # A tibble: 24,144 x 8
## hhno logday schml pkdl othl mlwhl midml mlso
## <int> <int> <int> <int> <int> <int> <int> <int>
## 1 20002 1 0 0 0 0 0 1
## 2 20002 3 0 0 0 0 0 1
## 3 20002 4 1 0 0 0 2 2
## 4 20002 5 1 0 0 0 3 4
## 5 20002 6 1 0 0 0 2 2
## 6 20002 7 1 0 0 0 2 2
## 7 20004 1 0 0 0 0 1 1
## 8 20004 2 0 0 1 0 2 4
## 9 20004 4 0 1 0 0 1 1
## 10 20004 5 0 1 0 0 1 1
## # … with 24,134 more rows
Import all files into a single dataframe
This will use the column specification but since all columns are integers there won’t be a change in your data.
Load the dataframe
meals_out <- map_dfr(mealsout_paths, ~ read_tsv(.x, col_types = column_specification))
sample_n(meals_out, 6, replace = FALSE) ## # A tibble: 6 x 8
## hhno logday schml pkdl othl mlwhl midml mlso
## <int> <int> <int> <int> <int> <int> <int> <int>
## 1 200081 6 0 0 0 0 2 2
## 2 251171 3 0 0 0 0 0 2
## 3 63445 1 0 0 0 0 1 1
## 4 67008 2 0 0 0 0 0 2
## 5 95533 3 0 0 0 0 1 1
## 6 229101 4 0 0 0 0 1 1
Extract year from mealsout_paths into new column
Note that this is a complete replacement for several steps shown above
meals <- vroom::vroom(mealsout_paths, id = "path")
meals <- meals %>% extract(path, "year", "(\\d{4})", convert = TRUE)
meals## # A tibble: 807,068 x 9
## year hhno logday schml pkdl othl mlwhl midml mlso
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1974 20002 1 0 0 0 0 0 1
## 2 1974 20002 3 0 0 0 0 0 1
## 3 1974 20002 4 1 0 0 0 2 2
## 4 1974 20002 5 1 0 0 0 3 4
## 5 1974 20002 6 1 0 0 0 2 2
## 6 1974 20002 7 1 0 0 0 2 2
## 7 1974 20004 1 0 0 0 0 1 1
## 8 1974 20004 2 0 0 1 0 2 4
## 9 1974 20004 4 0 1 0 0 1 1
## 10 1974 20004 5 0 1 0 0 1 1
## # … with 807,058 more rows
Data Dictionary of meals dataframe
FIELD DESCRIPTION hhno household number logday logday schml school meals provided pkdl number of packed lunches othl other lunches out mlwhl number of meals on wheels midml total number of midday meals out mlso total number of meals out
Since the last column mlso doesn’t equal the sum of the other columns we’ll assume that it refers to evening meals out.
Drop unneeded columns and convert to long table
meals_skinny <- meals %>%
select(-hhno, -logday) %>%
pivot_longer(cols = schml:mlso, names_to = "meal_type",
values_to = "meals_out")
meals_skinny$meal_type <- str_replace(meals_skinny$meal_type, "schml", "School Lunch")
meals_skinny$meal_type <- str_replace(meals_skinny$meal_type, "pkdl", "Packed Lunch")
meals_skinny$meal_type <- str_replace(meals_skinny$meal_type, "othl", "Other Lunch")
meals_skinny$meal_type <- str_replace(meals_skinny$meal_type, "mlwhl", "Meals on Wheels")
meals_skinny$meal_type <- str_replace(meals_skinny$meal_type, "midml", "Midday Meal")
meals_skinny$meal_type <- str_replace(meals_skinny$meal_type, "mlso", "Evening Meal")
meals_skinny$meal_type <- factor(meals_skinny$meal_type)
glimpse(meals_skinny)## Observations: 4,842,408
## Variables: 3
## $ year <int> 1974, 1974, 1974, 1974, 1974, 1974, 1974, 1974, 1974, 1974,…
## $ meal_type <fct> School Lunch, Packed Lunch, Other Lunch, Meals on Wheels, M…
## $ meals_out <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 2, 2, 1, 0,…
Plot all types of meals across years
meals_skinny %>%
group_by(year, meal_type) %>%
summarise(avg_meals_out = mean(meals_out)) %>%
ggplot(aes(x=year, y=avg_meals_out, color=meal_type)) +
geom_line()A little noisy.
Plot them individually
meals_skinny %>%
group_by(year, meal_type) %>%
summarise(avg_meals_out = mean(meals_out)) %>%
ungroup() %>%
ggplot(aes(x=year, y=avg_meals_out)) +
geom_line() +
facet_wrap(~meal_type, nrow = 3, ncol = 2)Meals on wheels, Other Lunch and Packed Lunch are not trending so we can remove them to highlight the others.
Try stacking the plots to see how it looks
`%notin%` <- Negate(`%in%`)
meals_skinny %>%
filter(meal_type %notin% c("Meals on Wheels", "Other Lunch", "Packed Lunch")) %>%
group_by(year, meal_type) %>%
summarise(avg_meals_out = mean(meals_out)) %>%
ungroup() %>%
ggplot(aes(x=year, y=avg_meals_out)) +
geom_line() +
facet_wrap(~meal_type, nrow = 3)Yuck!!
Try original plot with three types
meals_skinny %>%
filter(meal_type %notin% c("Meals on Wheels", "Other Lunch", "Packed Lunch")) %>%
group_by(year, meal_type) %>%
summarise(avg_meals_out = mean(meals_out)) %>%
ungroup() %>%
ggplot(aes(x=year, y=avg_meals_out, color = meal_type)) +
geom_line() +
ggtitle("Sharp drop in meals eaten out since 1995") +
labs(x="Year", y="Average meals out") +
theme_minimal()Much better.
Act 3 - Finale
BOSS: O’Neil!! Where is my pdf?
Aside: The Boss likes his charts in pdf form for some reason. ggplot has a function for that.
Use ggsave() to save the last plot
There are many options but we want a pdf so all I do is specify the path and filename.
ggsave("xyz/MealsOut.pdf")ME: It’s in the xyz folder. Take a look and let me know how you want it dressed up. I’m heading out for a three-martini lunch.
Closing
If this situation hasn’t happened to you yet, it will. In many cases it’s easier to open the zip files and manually select the data files you need. In this case we would have had to do that 27 times. This pattern which Hadley Wickham introduces in Joy of Function Programming is fast, efficient, and understandable. It also opens up a new way of thinking about working with complex data structures in R.
Let’s say that two years have passed and we get a new zip file. The 30 lines of code below is all that is needed to produce the graph from the original zip file. Our future-self will thank us.
paths <- dir_ls("xyz/")
unzip(paths, overwrite = TRUE, exdir = "nfs_zip_files/")
paths <- dir_ls("nfs_zip_files/")
paths <- setdiff(paths, "nfs_zip_files/NFS_ReferenceCodesDescriptors.zip")
walk(paths, ~ unzip(.x, exdir = "nfs_source_files"))
mealsout_paths <- dir_ls("nfs_source_files", regexp = "mealsout")
meals <- vroom::vroom(mealsout_paths, id = "path")
meals <- meals %>% extract(path, "year", "(\\d{4})", convert = TRUE)
meals_long <- meals %>%
select(-hhno, -logday) %>%
pivot_longer(cols = schml:mlso, names_to = "meal_type",
values_to = "meals_out")
meals_long$meal_type <- str_replace(meals_long$meal_type, "schml", "School Lunch")
meals_long$meal_type <- str_replace(meals_long$meal_type, "pkdl", "Packed Lunch")
meals_long$meal_type <- str_replace(meals_long$meal_type, "othl", "Other Lunch")
meals_long$meal_type <- str_replace(meals_long$meal_type, "mlwhl", "Meals on Wheels")
meals_long$meal_type <- str_replace(meals_long$meal_type, "midml", "Midday Meal")
meals_long$meal_type <- str_replace(meals_long$meal_type, "mlso", "Evening Meal")
meals_long$meal_type <- factor(meals_long$meal_type)
`%notin%` <- Negate(`%in%`)
meals_long %>%
filter(meal_type %notin% c("Meals on Wheels", "Other Lunch", "Packed Lunch")) %>%
group_by(year, meal_type) %>%
summarise(avg_meals_out = mean(meals_out)) %>%
ungroup() %>%
ggplot(aes(x=year, y=avg_meals_out, color = meal_type)) +
geom_line() +
ggtitle("Sharp drop in meals eaten out since 1995") +
labs(x="Year", y="Average meals out") +
theme_minimal()
ggsave("xyz/MealsOut.pdf")