Summary

The body of this paper is written as a short 1st-person skit based in a London news room in early 2001. I am a young data analyst/visualizer for the newspaper.
I am handed a seemlingly simple assignment and it turns out to be trickier than I expected.
I start with a single zip file which contains 27 other zip files, which contain several files each and only need one file from each zip.
During the course of this paper you will learn how to pull only the files you are interested in while also creating references to the source files.
In the end we will create a simple chart with the final data and save it to a pdf file.

Data sources

https://www.gov.uk/government/statistics/family-food-open-data
https://data.gov.uk/dataset/5c1a7a5d-4dd5-4b1b-84f2-3ba8883a07ca/family-food-open-data
https://webarchive.nationalarchives.gov.uk/20130103024837/http://www.defra.gov.uk/statistics/foodfarm/food/familyfood/nationalfoodsurvey/

Inspired by Hadley Wickham’s Joy of Function Programming

Load libraries and empty folders

library(tidyverse)
library(fs) 
dir_delete("nfs_zip_files")
dir_create("nfs_zip_files")
dir_delete("nfs_source_files")
dir_create("nfs_source_files")
dir_delete("nfs_meals_out")
dir_create("nfs_meals_out")

Act 1

Setting: UK Newsroom in early 2001. I work in their infographics and visualization dept.

Aside: 2001 is like ancient history. I imagine that the news room looked something like this.

BOSS: O’Neil!! I’ve got a new assignment for you. The 2000 National Food Survey data just dropped and I need a chart of how the number of meals that people eat out per week has changed over time. The National Food Survey data is in the xyz folder. Get on it. Pronto!!

ME: Yes sir. Right away sir.

Aside: My Boss likes to call people by their last names and thinks this is a slam dunk. We supposedly have the data and he only needs a simple line graph. What could go wrong?

Act 2

Setting: Me toiling at my desk

What’s in the xyz folder

paths <- dir_ls("xyz/")
paths

## xyz/NFSopen_AllData.zip

Oh wonderful, a zip file.
Zip files are the “Mystery Box” of data. You never know what’s inside. It could be a well-formatted csv or it could be anything else.

What’s in the zip file

There is only one file but it’s good to get in the practice of looking at the first element explicitly because it becomes part of a pattern later.
In the unzip function list = TRUE only displays the data and does not actually unzip the files to a folder

x <- paths[[1]]
unzip(x, list = TRUE)

##                                 Name  Length                Date
## 1                       NFS_1992.zip 2464072 2016-01-29 14:54:00
## 2                       NFS_1993.zip 2750096 2016-01-29 15:03:00
## 3                       NFS_1994.zip 2661643 2016-02-02 14:42:00
## 4                       NFS_1995.zip 2765155 2016-02-02 14:42:00
## 5                       NFS_1996.zip 3048139 2016-02-02 14:43:00
## 6                       NFS_1997.zip 2374082 2016-02-02 14:43:00
## 7                       NFS_1998.zip 2280058 2016-02-02 14:43:00
## 8                       NFS_1999.zip 2275573 2016-02-02 14:44:00
## 9                       NFS_2000.zip 2305816 2016-02-02 14:44:00
## 10 NFS_ReferenceCodesDescriptors.zip 3971143 2016-02-17 12:14:00
## 11                      NFS_1974.zip 2463461 2016-01-28 19:01:00
## 12                      NFS_1975.zip 2571030 2016-01-28 18:16:00
## 13                      NFS_1976.zip 2603004 2016-01-28 19:10:00
## 14                      NFS_1977.zip 2690012 2016-01-28 19:20:00
## 15                      NFS_1978.zip 2596621 2016-01-28 19:28:00
## 16                      NFS_1979.zip 2434235 2016-01-29 09:43:00
## 17                      NFS_1980.zip 2678547 2016-01-29 10:00:00
## 18                      NFS_1981.zip 2592301 2016-02-01 09:04:00
## 19                      NFS_1982.zip 2649739 2016-01-29 10:52:00
## 20                      NFS_1983.zip 2399506 2016-01-29 10:57:00
## 21                      NFS_1984.zip 2322480 2016-01-29 11:33:00
## 22                      NFS_1985.zip 2339160 2016-01-29 13:02:00
## 23                      NFS_1986.zip 2320710 2016-01-29 13:10:00
## 24                      NFS_1987.zip 2387542 2016-01-29 13:21:00
## 25                      NFS_1988.zip 2378935 2016-01-29 13:28:00
## 26                      NFS_1989.zip 2487599 2016-01-29 14:31:00
## 27                      NFS_1990.zip 2233665 2016-01-29 14:37:00
## 28                      NFS_1991.zip 2143935 2016-01-29 14:44:00

Ughh!!! More zip files. It looks like one for every year.

Useful trick: Only show file names

Most of the time you don’t want all the extra data like length & date. Use the Name attribute to condense the output.

unzip(x, list = TRUE)$Name

##  [1] "NFS_1992.zip"                      "NFS_1993.zip"                     
##  [3] "NFS_1994.zip"                      "NFS_1995.zip"                     
##  [5] "NFS_1996.zip"                      "NFS_1997.zip"                     
##  [7] "NFS_1998.zip"                      "NFS_1999.zip"                     
##  [9] "NFS_2000.zip"                      "NFS_ReferenceCodesDescriptors.zip"
## [11] "NFS_1974.zip"                      "NFS_1975.zip"                     
## [13] "NFS_1976.zip"                      "NFS_1977.zip"                     
## [15] "NFS_1978.zip"                      "NFS_1979.zip"                     
## [17] "NFS_1980.zip"                      "NFS_1981.zip"                     
## [19] "NFS_1982.zip"                      "NFS_1983.zip"                     
## [21] "NFS_1984.zip"                      "NFS_1985.zip"                     
## [23] "NFS_1986.zip"                      "NFS_1987.zip"                     
## [25] "NFS_1988.zip"                      "NFS_1989.zip"                     
## [27] "NFS_1990.zip"                      "NFS_1991.zip"

Unzip to another folder

Specify the destination folder with exdir =. The default for list = is FALSE so we won’t need to include that to unzip.

unzip(x, overwrite = TRUE, exdir = "nfs_zip_files/")
dir_ls("nfs_zip_files/")

## nfs_zip_files/NFS_1974.zip
## nfs_zip_files/NFS_1975.zip
## nfs_zip_files/NFS_1976.zip
## nfs_zip_files/NFS_1977.zip
## nfs_zip_files/NFS_1978.zip
## nfs_zip_files/NFS_1979.zip
## nfs_zip_files/NFS_1980.zip
## nfs_zip_files/NFS_1981.zip
## nfs_zip_files/NFS_1982.zip
## nfs_zip_files/NFS_1983.zip
## nfs_zip_files/NFS_1984.zip
## nfs_zip_files/NFS_1985.zip
## nfs_zip_files/NFS_1986.zip
## nfs_zip_files/NFS_1987.zip
## nfs_zip_files/NFS_1988.zip
## nfs_zip_files/NFS_1989.zip
## nfs_zip_files/NFS_1990.zip
## nfs_zip_files/NFS_1991.zip
## nfs_zip_files/NFS_1992.zip
## nfs_zip_files/NFS_1993.zip
## nfs_zip_files/NFS_1994.zip
## nfs_zip_files/NFS_1995.zip
## nfs_zip_files/NFS_1996.zip
## nfs_zip_files/NFS_1997.zip
## nfs_zip_files/NFS_1998.zip
## nfs_zip_files/NFS_1999.zip
## nfs_zip_files/NFS_2000.zip
## nfs_zip_files/NFS_ReferenceCodesDescriptors.zip

Drop the odd file from the paths that we’ll use going forward

There is one file that looks like a data dictionary. We might need that later and can exclude it now for the purpose of data processing.

paths <- dir_ls("nfs_zip_files/")
paths <- setdiff(paths, "nfs_zip_files/NFS_ReferenceCodesDescriptors.zip")
paths

##  [1] "nfs_zip_files/NFS_1974.zip" "nfs_zip_files/NFS_1975.zip"
##  [3] "nfs_zip_files/NFS_1976.zip" "nfs_zip_files/NFS_1977.zip"
##  [5] "nfs_zip_files/NFS_1978.zip" "nfs_zip_files/NFS_1979.zip"
##  [7] "nfs_zip_files/NFS_1980.zip" "nfs_zip_files/NFS_1981.zip"
##  [9] "nfs_zip_files/NFS_1982.zip" "nfs_zip_files/NFS_1983.zip"
## [11] "nfs_zip_files/NFS_1984.zip" "nfs_zip_files/NFS_1985.zip"
## [13] "nfs_zip_files/NFS_1986.zip" "nfs_zip_files/NFS_1987.zip"
## [15] "nfs_zip_files/NFS_1988.zip" "nfs_zip_files/NFS_1989.zip"
## [17] "nfs_zip_files/NFS_1990.zip" "nfs_zip_files/NFS_1991.zip"
## [19] "nfs_zip_files/NFS_1992.zip" "nfs_zip_files/NFS_1993.zip"
## [21] "nfs_zip_files/NFS_1994.zip" "nfs_zip_files/NFS_1995.zip"
## [23] "nfs_zip_files/NFS_1996.zip" "nfs_zip_files/NFS_1997.zip"
## [25] "nfs_zip_files/NFS_1998.zip" "nfs_zip_files/NFS_1999.zip"
## [27] "nfs_zip_files/NFS_2000.zip"

Look inside the first zip file

Hoping for no more zip files. Fingers crossed.

x <- paths[[1]]
unzip(x, list = TRUE)$Name

## [1] "1974 household data.txt"                      
## [2] "1974 mealsout.txt"                            
## [3] "1974 nutrient conversion factors data.txt"    
## [4] "1974 visitor data.txt"                        
## [5] "1974 diary data.txt"                          
## [6] "1974 household by pregnancy or under twos.txt"

Sigh of relief. This zip contains several txt files which is probably the data and I see that one of them is called “mealsout”. We are getting closer.

How do I unzip all these files into a single folder?

I’m going to take a minute to explain the next part because it’s the magic in the process.
The map() functions in the purrr package allow you to do bulk processing.

The pattern is:
1. Take a function that you can apply to one element
2. Generalize it
3. Apply it to all the elements

If you try to unzip directly by passing paths it fails with the error below so you have to send the filenames to it one at a time. Works like a for-next loop. Unzip worked on the original zip file because there was only one element in the paths vector. If there were more than one it would have generated this error.

unzip(paths, list = TRUE)$Name

Error in unzip(paths, list = TRUE) : invalid zip name argument

The basic pattern

Use a function that applies to a single element

In this case the first element is a zip file containing text files that we want to look at all the zips’ filenames.

x <- paths[[1]]
unzip(x, list = TRUE)$Name

Generalized it

Add a tilde(~) in the front and change the first argument, x, to a pronoun by adding a dot in front of it like this, .x. In this way it can sequence over the entire vector of filenames.

 ~ unzip(.x, list = TRUE)$Name

Map the paths to the unzip() function

In English the code below reads: Show the names inside each zip file referred to in the paths vector.

file_names <- map(paths, ~ unzip(.x, list = TRUE)$Name)
file_names[[1]]

## [1] "1974 household data.txt"                      
## [2] "1974 mealsout.txt"                            
## [3] "1974 nutrient conversion factors data.txt"    
## [4] "1974 visitor data.txt"                        
## [5] "1974 diary data.txt"                          
## [6] "1974 household by pregnancy or under twos.txt"

It returned a list of 27 elements representing the zip files. The individual elements contain the file names in the zip.
There is a file called mealsout which is what we are looking for. This is a good sign because we can gather only those files and ignore the rest for our analysis.

Put this pattern to use

Create a new dataframe for original source and specific file.

Before we actually unzip these we want to create a reference to the source files because bad things happen when you don’t.

First, do it for one element

x <- paths[[1]]
tibble(path = x, file = unzip(x, list = TRUE)$Name)

Then generalize it

~ tibble(path = .x, file = unzip(.x, list = TRUE)$Name)

Then apply it

Since we know that we want a dataframe we can use the dataframe-specific map() function so that we don’t have to convert a list to a df.

# Then we make a recipe and use map()
source_files <- map_dfr(paths, ~ tibble(path = .x, files = unzip(.x, list = TRUE)$Name))
source_files

## # A tibble: 162 x 2
##    path                       files                                        
##    <chr>                      <chr>                                        
##  1 nfs_zip_files/NFS_1974.zip 1974 household data.txt                      
##  2 nfs_zip_files/NFS_1974.zip 1974 mealsout.txt                            
##  3 nfs_zip_files/NFS_1974.zip 1974 nutrient conversion factors data.txt    
##  4 nfs_zip_files/NFS_1974.zip 1974 visitor data.txt                        
##  5 nfs_zip_files/NFS_1974.zip 1974 diary data.txt                          
##  6 nfs_zip_files/NFS_1974.zip 1974 household by pregnancy or under twos.txt
##  7 nfs_zip_files/NFS_1975.zip 1975 mealsout.txt                            
##  8 nfs_zip_files/NFS_1975.zip 1975 nutrient conversion factors data.txt    
##  9 nfs_zip_files/NFS_1975.zip 1975 visitor data.txt                        
## 10 nfs_zip_files/NFS_1975.zip 1975 diary data.txt                          
## # … with 152 more rows

Progress, now we have a dataframe of the source zip files and the source text files. There area total of 162.

How do we know if each year has the same file names?

Very consistent at 27 files each except two files on Nutrition Conversion Factors that add up to 27.

source_files %>%
  extract(files, c("year", "name"), "(\\d{4}) (.*)\\.txt") %>%
  count(name)

## # A tibble: 7 x 2
##   name                                     n
##   <chr>                                <int>
## 1 diary data                              27
## 2 household by pregnancy or under twos    27
## 3 household data                          27
## 4 mealsout                                27
## 5 nutrient conversion factors              1
## 6 nutrient conversion factors data        26
## 7 visitor data                            27

We are only concerned with meals out and it has the maximum of 27 files. The other files can be ignored.

Unfortunately we can’t get the meals out files without unzipping all of the files first so let’s do that.

Unzip to a new folder

First, let me show you that the destination folder is empty (aka nothing up my sleeve)

dir_ls("nfs_source_files")

## character(0)

Extract the first zip to the destination folder

x <- paths[[1]]
unzip(x, exdir = "nfs_source_files")
dir_ls("nfs_source_files")

## nfs_source_files/1974 diary data.txt
## nfs_source_files/1974 household by pregnancy or under twos.txt
## nfs_source_files/1974 household data.txt
## nfs_source_files/1974 mealsout.txt
## nfs_source_files/1974 nutrient conversion factors data.txt
## nfs_source_files/1974 visitor data.txt

Success!! Now we can generalize it.

Extract all files to destination folder

Note that walk() is the same as map() except it doesn’t print the output on the console.

walk(paths, ~ unzip(.x, exdir = "nfs_source_files"))
sample(dir("nfs_source_files"), 10, replace = FALSE) # Get 10 random filenames

##  [1] "1984 diary data.txt"                      
##  [2] "1982 nutrient conversion factors data.txt"
##  [3] "1979 mealsout.txt"                        
##  [4] "1989 diary data.txt"                      
##  [5] "1999 mealsout.txt"                        
##  [6] "1983 diary data.txt"                      
##  [7] "1977 household data.txt"                  
##  [8] "1977 diary data.txt"                      
##  [9] "1975 nutrient conversion factors data.txt"
## [10] "1997 mealsout.txt"

All files look to be extracted.

Get only the meals-out files

We’ll use a regular expression to identify meals out files.
Anyone who has worked with zip files and regular expressions in the past will probably have their blood pressure go up a little with this approach. So many things can go wrong. What evs, full speed ahead!!!

Get the paths of the meals-out files

mealsout_paths <- dir_ls("nfs_source_files", regexp = "mealsout")
mealsout_paths

## nfs_source_files/1974 mealsout.txt nfs_source_files/1975 mealsout.txt 
## nfs_source_files/1976 mealsout.txt nfs_source_files/1977 mealsout.txt 
## nfs_source_files/1978 mealsout.txt nfs_source_files/1979 mealsout.txt 
## nfs_source_files/1980 mealsout.txt nfs_source_files/1981 mealsout.txt 
## nfs_source_files/1982 mealsout.txt nfs_source_files/1983 mealsout.txt 
## nfs_source_files/1984 mealsout.txt nfs_source_files/1985 mealsout.txt 
## nfs_source_files/1986 mealsout.txt nfs_source_files/1987 mealsout.txt 
## nfs_source_files/1988 mealsout.txt nfs_source_files/1989 mealsout.txt 
## nfs_source_files/1990 mealsout.txt nfs_source_files/1991 mealsout.txt 
## nfs_source_files/1992 mealsout.txt nfs_source_files/1993 mealsout.txt 
## nfs_source_files/1994 mealsout.txt nfs_source_files/1995 mealsout.txt 
## nfs_source_files/1996 mealsout.txt nfs_source_files/1997 mealsout.txt 
## nfs_source_files/1998 mealsout.txt nfs_source_files/1999 mealsout.txt 
## nfs_source_files/2000 mealsout.txt

Looks correct. This survey is conducted every two years and the last year we have is 2000.

Show the first few lines of the first file

Hoping there is a header on row 1 and the rest is data.

# What's the first few lines from first  file?
x <- mealsout_paths[[1]]
read_lines(x, n_max = 6)

## [1] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [2] "20002\t1\t0\t0\t0\t0\t0\t1"                         
## [3] "20002\t3\t0\t0\t0\t0\t0\t1"                         
## [4] "20002\t4\t1\t0\t0\t0\t2\t2"                         
## [5] "20002\t5\t1\t0\t0\t0\t3\t4"                         
## [6] "20002\t6\t1\t0\t0\t0\t2\t2"

The lucky stars are shining upon us. The first row is the header and the rest are numeric data and possibly all integers. We can also tell that it is a tab-delimited file.

Ensure that all headers are the same

Trust but Verify

# A character vector would be simpler
# The `unname()` function does exactly what it says and removes names attribute
unname(map_chr(mealsout_paths, ~ read_lines(.x, n_max = 1)))

##  [1] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
##  [2] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
##  [3] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
##  [4] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
##  [5] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
##  [6] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
##  [7] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
##  [8] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
##  [9] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [10] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [11] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [12] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [13] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [14] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [15] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [16] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [17] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [18] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [19] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [20] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [21] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [22] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [23] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [24] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [25] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [26] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"
## [27] "hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso"

… or …

table(map_chr(mealsout_paths, ~ read_lines(.x, n_max = 1)))

## 
## hhno\tlogday\tschml\tpkdl\tothl\tmlwhl\tmidml\tmlso 
##                                                  27

PHEW! All file formats are the same.

How do I read in the data for the files we want?

We know it’s tab-delimited so read_tsv() is going to be part of our base function

Bring in only integer columns

This is an interesting pattern and not mandatory for what we are doing and I’m sure you can see the different ways that it can be applied.

Create a recipe for integer columns

col_spec means a specification for the column types you want.

column_specification <- cols(.default = col_integer())
class(column_specification)

## [1] "col_spec"

Apply col_spec while reading in file The col_types argument can accept a col_spec object.

View the data

We haven’t assigned the data to an object yet.

read_tsv(x, col_types = column_specification)

## # A tibble: 24,144 x 8
##     hhno logday schml  pkdl  othl mlwhl midml  mlso
##    <int>  <int> <int> <int> <int> <int> <int> <int>
##  1 20002      1     0     0     0     0     0     1
##  2 20002      3     0     0     0     0     0     1
##  3 20002      4     1     0     0     0     2     2
##  4 20002      5     1     0     0     0     3     4
##  5 20002      6     1     0     0     0     2     2
##  6 20002      7     1     0     0     0     2     2
##  7 20004      1     0     0     0     0     1     1
##  8 20004      2     0     0     1     0     2     4
##  9 20004      4     0     1     0     0     1     1
## 10 20004      5     0     1     0     0     1     1
## # … with 24,134 more rows

Import all files into a single dataframe

This will use the column specification but since all columns are integers there won’t be a change in your data.

Load the dataframe

meals_out <- map_dfr(mealsout_paths, ~ read_tsv(.x, col_types = column_specification))
sample_n(meals_out, 6, replace = FALSE)

## # A tibble: 6 x 8
##     hhno logday schml  pkdl  othl mlwhl midml  mlso
##    <int>  <int> <int> <int> <int> <int> <int> <int>
## 1 200081      6     0     0     0     0     2     2
## 2 251171      3     0     0     0     0     0     2
## 3  63445      1     0     0     0     0     1     1
## 4  67008      2     0     0     0     0     0     2
## 5  95533      3     0     0     0     0     1     1
## 6 229101      4     0     0     0     0     1     1

Extract year from mealsout_paths into new column

Note that this is a complete replacement for several steps shown above

meals <- vroom::vroom(mealsout_paths, id = "path")
meals <- meals %>% extract(path, "year", "(\\d{4})", convert = TRUE)
meals

## # A tibble: 807,068 x 9
##     year  hhno logday schml  pkdl  othl mlwhl midml  mlso
##    <int> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1974 20002      1     0     0     0     0     0     1
##  2  1974 20002      3     0     0     0     0     0     1
##  3  1974 20002      4     1     0     0     0     2     2
##  4  1974 20002      5     1     0     0     0     3     4
##  5  1974 20002      6     1     0     0     0     2     2
##  6  1974 20002      7     1     0     0     0     2     2
##  7  1974 20004      1     0     0     0     0     1     1
##  8  1974 20004      2     0     0     1     0     2     4
##  9  1974 20004      4     0     1     0     0     1     1
## 10  1974 20004      5     0     1     0     0     1     1
## # … with 807,058 more rows

Data Dictionary of meals dataframe

FIELD DESCRIPTION

hhno household number

logday logday

schml school meals provided

pkdl number of packed lunches

othl other lunches out

mlwhl number of meals on wheels

midml total number of midday meals out

mlso total number of meals out

FIELD	DESCRIPTION
hhno	household number
logday	logday
schml	school meals provided
pkdl	number of packed lunches
othl	other lunches out
mlwhl	number of meals on wheels
midml	total number of midday meals out
mlso	total number of meals out

Since the last column mlso doesn’t equal the sum of the other columns we’ll assume that it refers to evening meals out.

Drop unneeded columns and convert to long table

meals_skinny <- meals %>%
  select(-hhno, -logday) %>%
  pivot_longer(cols = schml:mlso, names_to = "meal_type", 
               values_to = "meals_out") 
meals_skinny$meal_type <- str_replace(meals_skinny$meal_type, "schml", "School Lunch")
meals_skinny$meal_type <- str_replace(meals_skinny$meal_type, "pkdl", "Packed Lunch")
meals_skinny$meal_type <- str_replace(meals_skinny$meal_type, "othl", "Other Lunch")  
meals_skinny$meal_type <- str_replace(meals_skinny$meal_type, "mlwhl", "Meals on Wheels")
meals_skinny$meal_type <- str_replace(meals_skinny$meal_type, "midml", "Midday Meal")
meals_skinny$meal_type <- str_replace(meals_skinny$meal_type, "mlso", "Evening Meal")
meals_skinny$meal_type <- factor(meals_skinny$meal_type)
glimpse(meals_skinny)

## Observations: 4,842,408
## Variables: 3
## $ year      <int> 1974, 1974, 1974, 1974, 1974, 1974, 1974, 1974, 1974, 1974,…
## $ meal_type <fct> School Lunch, Packed Lunch, Other Lunch, Meals on Wheels, M…
## $ meals_out <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 2, 2, 1, 0,…

Plot all types of meals across years

meals_skinny %>%
  group_by(year, meal_type) %>%
  summarise(avg_meals_out = mean(meals_out)) %>%
  ggplot(aes(x=year, y=avg_meals_out, color=meal_type)) +
  geom_line()

A little noisy.

Plot them individually

meals_skinny %>%
  group_by(year, meal_type) %>%
  summarise(avg_meals_out = mean(meals_out)) %>%
  ungroup() %>%
  ggplot(aes(x=year, y=avg_meals_out)) +
  geom_line() +
  facet_wrap(~meal_type, nrow = 3, ncol = 2)

Meals on wheels, Other Lunch and Packed Lunch are not trending so we can remove them to highlight the others.

Try stacking the plots to see how it looks

`%notin%` <- Negate(`%in%`)
meals_skinny %>% 
  filter(meal_type %notin% c("Meals on Wheels", "Other Lunch", "Packed Lunch")) %>%
  group_by(year, meal_type) %>%
  summarise(avg_meals_out = mean(meals_out)) %>%
  ungroup() %>%
  ggplot(aes(x=year, y=avg_meals_out)) +
  geom_line() +
  facet_wrap(~meal_type, nrow = 3)

Yuck!!

Try original plot with three types

meals_skinny %>% 
  filter(meal_type %notin% c("Meals on Wheels", "Other Lunch", "Packed Lunch")) %>%
  group_by(year, meal_type) %>%
  summarise(avg_meals_out = mean(meals_out)) %>%
  ungroup() %>%
  ggplot(aes(x=year, y=avg_meals_out, color = meal_type)) +
  geom_line() +
  ggtitle("Sharp drop in meals eaten out since 1995") + 
  labs(x="Year", y="Average meals out") +
  theme_minimal()

Much better.

Act 3 - Finale

BOSS: O’Neil!! Where is my pdf?

Aside: The Boss likes his charts in pdf form for some reason. ggplot has a function for that.

Use `ggsave()` to save the last plot

There are many options but we want a pdf so all I do is specify the path and filename.

ggsave("xyz/MealsOut.pdf")

ME: It’s in the xyz folder. Take a look and let me know how you want it dressed up. I’m heading out for a three-martini lunch.

Closing

If this situation hasn’t happened to you yet, it will. In many cases it’s easier to open the zip files and manually select the data files you need. In this case we would have had to do that 27 times. This pattern which Hadley Wickham introduces in Joy of Function Programming is fast, efficient, and understandable. It also opens up a new way of thinking about working with complex data structures in R.

Let’s say that two years have passed and we get a new zip file. The 30 lines of code below is all that is needed to produce the graph from the original zip file. Our future-self will thank us.

paths <- dir_ls("xyz/")
unzip(paths, overwrite = TRUE, exdir = "nfs_zip_files/") 
paths <- dir_ls("nfs_zip_files/") 
paths <- setdiff(paths, "nfs_zip_files/NFS_ReferenceCodesDescriptors.zip") 
walk(paths, ~ unzip(.x, exdir = "nfs_source_files"))
mealsout_paths <- dir_ls("nfs_source_files", regexp = "mealsout")
meals <- vroom::vroom(mealsout_paths, id = "path") 
meals <- meals %>% extract(path, "year", "(\\d{4})", convert = TRUE)
meals_long <- meals %>%
  select(-hhno, -logday) %>%
  pivot_longer(cols = schml:mlso, names_to = "meal_type",
               values_to = "meals_out")
meals_long$meal_type <- str_replace(meals_long$meal_type, "schml", "School Lunch")
meals_long$meal_type <- str_replace(meals_long$meal_type, "pkdl", "Packed Lunch")
meals_long$meal_type <- str_replace(meals_long$meal_type, "othl", "Other Lunch")
meals_long$meal_type <- str_replace(meals_long$meal_type, "mlwhl", "Meals on Wheels")
meals_long$meal_type <- str_replace(meals_long$meal_type, "midml", "Midday Meal")
meals_long$meal_type <- str_replace(meals_long$meal_type, "mlso", "Evening Meal")
meals_long$meal_type <- factor(meals_long$meal_type)
`%notin%` <- Negate(`%in%`)
meals_long %>%
  filter(meal_type %notin% c("Meals on Wheels", "Other Lunch", "Packed Lunch")) %>%
  group_by(year, meal_type) %>%
  summarise(avg_meals_out = mean(meals_out)) %>%
  ungroup() %>%
  ggplot(aes(x=year, y=avg_meals_out, color = meal_type)) +
  geom_line() +
  ggtitle("Sharp drop in meals eaten out since 1995") +
  labs(x="Year", y="Average meals out") +
  theme_minimal()
ggsave("xyz/MealsOut.pdf")

Bulk processing of zip files

A tragic comedy in three acts

Kier O’Neil

11/29/2019