America has a problem with dairy, more specifically the overproduction of dairy products and American’s declining consumption of those products. The lack of dairy consumption in the United States as referenced in pieces by NPR and The Washington Post show that while milk production in the US has increased by 13% over the past 10 years, milk consumption is down nearly 40% over the same time period. In order to deal with this milk surplus, producers have turned to making other dairy products, such as cheese, to reduce spoilage. Less perishable items like cheese, canned and dry milk products, and even ice cream require additional manufacturing, storage, and transportation costs that ultimately come back to the consumer and dairy farmer. As for storage, this is turning into a large problem as the amount of cheese sitting in warehouses has reached a record high (Fig 1). Large and increasing numbers of dairy products sitting in storage combined with the current political climate regarding trade wars and tariffs should be of great concern to farmers and producers because if there is excess inventory, prices will likely fall.
This analysis will examine the many factors that go into dairy production and consumption to see if any insights can be found that may address the growing disparity between dairy production and consumption. My hypothesis, before any kind of data analysis has been performed, is if trends in consumer consumption of certain dairy products can be established over time, then producers can better focus their efforts at maximizing those sectors of the market which will hopefully boost sales and decrease the amount of dairy products that either sit in storage or go to waste.
The data set used in this analysis comes from the USDA and represents many different variables related to dairy production and consumption spanning from 1970-2017. The data set was spawned from a GitHub tidytuesday project, where the goal of #TidyTuesday is to provide a forum for practicing data wrangling skills.
The data is stored in csv (comma-separated values) files and includes:
As csv files are usually difficult to interpret in their raw form we will be using the readr and tidyverse packages to both import and manipulate our raw data set. The readr package will be used to import the data from GitHub and the tidyverse package contains multiple packages such as dplyr, purrr, and ggplot2 which will be used for the creation of data frames, data cleaning, and visualization.
First, we must install and load the packages using:
#Make sure readr and tidyverse are installed first using:
#install.packages(c("readr","tidyverse"))
library(readr)
library(tidyverse)
Next, we must import the five csv files from their GitHub sources using read_csv from the readr package. Additionally, for ease of data visualization we will be converting each of the five csv files to tibbles using the as.tibble command from the dplyr package that exists in tidyverse. These two steps are accomplished using the following:
#Import csv files from online source
#Import milkcow_facts.csv
url_csv <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-29/milkcow_facts.csv"
milkcow_facts <- read_csv(url_csv)
milkcow_facts <- as.tibble(milkcow_facts)
#Import milk_products_facts.csv
url_csv <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-29/milk_products_facts.csv"
milk_products_facts <- read_csv(url_csv)
milk_products_facts <- as.tibble(milk_products_facts)
#Import fluid_milk_sales
url_csv <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-29/fluid_milk_sales.csv"
fluid_milk_sales <- read_csv(url_csv)
fluid_milk_sales <- as.tibble(fluid_milk_sales)
#Import clean_cheese
url_csv <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-29/clean_cheese.csv"
clean_cheese <- read_csv(url_csv)
clean_cheese <- as.tibble(clean_cheese)
#Import state_milk_production
url_csv <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-29/state_milk_production.csv"
state_milk_production <- read_csv(url_csv)
state_milk_production <- as.tibble(state_milk_production)
Please be aware that if the file location url changes on GitHub, this will prevent the csv files from being properly imported into R.
Now that the csv files have been imported, one should open them up to get a sense of the information that is contained in the raw data. The data dictionary for the five csv files is:
| variable | class | Description |
|---|---|---|
| year | date | Year |
| avg_milk_cow_number | double | Average number of milk cows |
| milk_per_cow | double | Average milk production/cow in pounds |
| milk_production_lbs | double | Total Milk production in pounds |
| avg_price_milk | double | Average price paid for milk (dollars per pound) |
| dairy_ration | double | Average price paid for dairy cow rations (dollars per pound) |
| milk_feed_price_ratio | double | Ratio of average price of milk per dairy cow ration |
| milk_cow_cost_per_animal | double | Average cost of milk cow per animal (dollars) |
| milk_volume_to_buy_cow_in_lbs | double | Milk volume required to purchase a cow (pounds) |
| alfalfa_hay_price | double | Alfalfa hay price received by farmers (tons) |
| slaughter_cow_price | double | Slaughter cow price (value of meat in dollars per pound) |
| variable | class | Description |
|---|---|---|
| year | date | Year |
| milk_type | integer | Category of Milk product |
| pounds | double | Pounds of milk product per year |
| variable | class | Description |
|---|---|---|
| year | date | Year |
| fluid_milk | double | Average milk consumption in lbs per person |
| fluid_yogurt | double | Average yogurt consumption in lbs per person |
| butter | double | Average butter consumption in lbs per person |
| cheese_american | double | Average American cheese consumption in lbs per person |
| cheese_other | double | Average other cheese consumption in lbs per person |
| cheese_cottage | double | Average cottage cheese consumption in lbs per person |
| evap_cnd_canned_whole_milk | double | Average evaporated and canned whole milk consumption in lbs per person |
| evap_cnd_bulk_whole_milk | double | Average evaporated and canned bulk whole milk consumption in lbs per person |
| evap_cnd_bulk_and_can_skim_milk | double | Average evaporated and canned bulk and can skim milk consumption in lbs per person |
| frozen_ice_cream_regular | double | Average regular frozen ice cream consumption in lbs per person |
| frozen_ice_cream_reduced_fat | double | Average reduced fat frozen ice cream consumption in lbs per person |
| frozen_sherbet | double | Average frozen sherbet consumption in lbs per person |
| frozen_other | double | Average other frozen milk product consumption in lbs per person |
| dry_whole_milk | double | Average dry whole milk consumption in lbs per person |
| dry_nonfat_milk | double | Average dry nonfat milk consumption in lbs per person |
| dry_buttermilk | double | Average dry buttermilk consumption in lbs per person |
| dry_whey | double | Average dry whey (milk protein) consumption in lbs per person |
| variable | class | Description |
|---|---|---|
| year | date | Year |
| Cheddar | double | Cheddar consumption in lbs per person |
| American Other | double | American Other consumption in lbs per person |
| Mozzarella | double | Mozzarella consumption in lbs per person |
| Italian other | double | Italian other consumption in lbs per person |
| Swiss | double | Swiss consumption in lbs per person |
| Brick | double | Brick consumption in lbs per person |
| Muenster | double | Muenster consumption in lbs per person |
| Cream and Neufchatel | double | Cream and Neufchatel consumption in lbs per person |
| Blue | double | Blue consumption in lbs per person |
| Other Dairy Cheese | double | Other Dairy Cheese consumption in lbs per person |
| Processed Cheese | double | Processed Cheese consumption in lbs per person |
| Foods and spreads | double | Foods and spreads consumption in lbs per person |
| Total American Chese | double | Total American Chese consumption in lbs per person |
| Total Italian Cheese | double | Total Italian Cheese consumption in lbs per person |
| Total Natural Cheese | double | Total Natural Cheese consumption in lbs per person |
| Total Processed Cheese Products | double | Total Processed Cheese Products consumption in lbs per person |
| variable | class | Description |
|---|---|---|
| region | character | Region of the US |
| state | character | US State |
| year | date | Year |
| milk_produced | double | Pounds of Milk Produced |
To analyze this data, we again will be using the tidyverse package and specifically many functions from the dplyr and tidyr packages within tidyverse.
Now that we have imported our data, converted it to more manageable tibbles, and have established our data dictionary, we should start looking at our data to see how ‘tidy’ it is. Remembering that tidy data has:
For instance, let us look at a tibble from both the clean_cheese and the fluid_milk_sales data:
#Display tibble of clean_cheese
clean_cheese
## # A tibble: 48 x 17
## Year Cheddar `American Other` Mozzarella `Italian other` Swiss Brick
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1970 5.79 1.2 1.19 0.87 0.88 0.1
## 2 1971 5.91 1.42 1.38 0.92 0.94 0.11
## 3 1972 6.01 1.67 1.57 1.02 1.06 0.1
## 4 1973 6.07 1.76 1.76 1.03 1.06 0.11
## 5 1974 6.31 2.16 1.86 1.09 1.18 0.11
## 6 1975 6.04 2.11 2.11 1.12 1.09 0.09
## 7 1976 6.45 2.44 2.31 1.24 1.25 0.09
## 8 1977 6.8 2.42 2.46 1.26 1.21 0.07
## 9 1978 6.94 2.59 2.68 1.37 1.33 0.08
## 10 1979 6.93 2.67 2.8 1.42 1.35 0.06
## # ... with 38 more rows, and 10 more variables: Muenster <dbl>, `Cream and
## # Neufchatel` <dbl>, Blue <dbl>, `Other Dairy Cheese` <dbl>, `Processed
## # Cheese` <dbl>, `Foods and spreads` <dbl>, `Total American
## # Chese` <dbl>, `Total Italian Cheese` <dbl>, `Total Natural
## # Cheese` <dbl>, `Total Processed Cheese Products` <dbl>
#Display tibble of fluid_milk_sales
fluid_milk_sales
## # A tibble: 387 x 3
## year milk_type pounds
## <dbl> <chr> <dbl>
## 1 1975 Whole 36188000000
## 2 1976 Whole 35241000000
## 3 1977 Whole 34036000000
## 4 1978 Whole 33235000000
## 5 1979 Whole 32480000000
## 6 1980 Whole 31253000000
## 7 1981 Whole 30397000000
## 8 1982 Whole 29350000000
## 9 1983 Whole 28871000000
## 10 1984 Whole 28204000000
## # ... with 377 more rows
Here we can see that in the clean_cheese data there are 48 rows and 17 columns representing the years 1970-2017 and 16 different cheese products and their consumption in lbs per person respectively. In the fluid_milk_sales there are 387 rows and 3 columns showing milk type sales per year in pounds. On first pass it seems like each set of data is tidy, but the way the data is structured makes it difficult to analyze. It would make more sense to transform the data in the clean_cheese set from a wide format to a long format. This is accomplished using the gather command in the tidyr package. We will also be correcting data types making sure that columns are characters when appropriate. We will keep year as a dbl to facilitate visualization:
#Gather and mutate clean_cheese into clean_cheese_long
clean_cheese_long <- clean_cheese %>%
gather(cheese_type, lbs_per_person, -Year) %>%
#Correct misspelling of 'Total American Chese' to 'Total American Cheese'
mutate(cheese_type = fct_recode(cheese_type, "Total American Cheese" = "Total American Chese"),
#Convert cheese_type from factor to character
cheese_type = as.character(cheese_type)) %>%
#Rename Year to year to fit with the rest of the data
rename(year = Year)
clean_cheese_long
## # A tibble: 768 x 3
## year cheese_type lbs_per_person
## <dbl> <chr> <dbl>
## 1 1970 Cheddar 5.79
## 2 1971 Cheddar 5.91
## 3 1972 Cheddar 6.01
## 4 1973 Cheddar 6.07
## 5 1974 Cheddar 6.31
## 6 1975 Cheddar 6.04
## 7 1976 Cheddar 6.45
## 8 1977 Cheddar 6.8
## 9 1978 Cheddar 6.94
## 10 1979 Cheddar 6.93
## # ... with 758 more rows
Additionally, clean_cheese is the only set that has missing values. We can identify the variable that has the missing values using the following code. For now we will leave in observations with missing values and the number of overall missing values is quite small:
#Discover NA values
colSums(is.na(clean_cheese))
## Year Cheddar
## 0 0
## American Other Mozzarella
## 0 0
## Italian other Swiss
## 0 0
## Brick Muenster
## 0 0
## Cream and Neufchatel Blue
## 0 12
## Other Dairy Cheese Processed Cheese
## 0 0
## Foods and spreads Total American Chese
## 0 0
## Total Italian Cheese Total Natural Cheese
## 0 0
## Total Processed Cheese Products
## 0
While we can also apply the gather command to the milk_products_facts data frame, it will be helpful later on in our analysis to have an additional column that represents the total average dairy product consumption in pounds per person. This total dairy consumption can then be grouped by year into the data frame yearly_dairy_consumption.
#Gather and mutate milk_products_facts into milk_products_facts_long
milk_products_facts <- milk_products_facts %>%
mutate(total_dairy = rowSums(.[2:17]))
milk_products_facts_long <- milk_products_facts %>%
gather(product, lbs_per_person, -year)
milk_products_facts_long
## # A tibble: 774 x 3
## year product lbs_per_person
## <dbl> <chr> <dbl>
## 1 1975 fluid_milk 247
## 2 1976 fluid_milk 247
## 3 1977 fluid_milk 244
## 4 1978 fluid_milk 241
## 5 1979 fluid_milk 238
## 6 1980 fluid_milk 234
## 7 1981 fluid_milk 230
## 8 1982 fluid_milk 224
## 9 1983 fluid_milk 223
## 10 1984 fluid_milk 224
## # ... with 764 more rows
#Creation of new data frame yearly_dairy_consumption
yearly_dairy_consumption <- milk_products_facts %>%
group_by(year) %>%
summarise(total_consumed = sum(total_dairy))
yearly_dairy_consumption
## # A tibble: 43 x 2
## year total_consumed
## <dbl> <dbl>
## 1 1975 313.
## 2 1976 313.
## 3 1977 310.
## 4 1978 307.
## 5 1979 303.
## 6 1980 299.
## 7 1981 295.
## 8 1982 291.
## 9 1983 292.
## 10 1984 295.
## # ... with 33 more rows
When looking at both the clean_cheese and milk_products_facts sets one may be tempted to join the two sets together. However, at this point in the analysis that may not be the best idea as there seems to be similar but slightly different data in the two sets. For instance in the milk_products_facts set there is a cheese_american product and in the clean_cheese set there are both American Other and Total American Cheese cheese types. It is too early in the analysis to join these two sets and assume they represent the same data.
As noted above, the fluid_milk_sales set is already in a long format so in sticking with our data convention we will not change it.
#Display tibble of fluid_milk_sales
fluid_milk_sales
## # A tibble: 387 x 3
## year milk_type pounds
## <dbl> <chr> <dbl>
## 1 1975 Whole 36188000000
## 2 1976 Whole 35241000000
## 3 1977 Whole 34036000000
## 4 1978 Whole 33235000000
## 5 1979 Whole 32480000000
## 6 1980 Whole 31253000000
## 7 1981 Whole 30397000000
## 8 1982 Whole 29350000000
## 9 1983 Whole 28871000000
## 10 1984 Whole 28204000000
## # ... with 377 more rows
The next set, state_milk_production is also in a long format but the column order is slightly confusing. We will rearrange the column order so year is the first column with the other columns representing region, State, and pounds of milk produced:
#Rearrange and display tibble of state_milk_production
state_milk_production <- state_milk_production %>%
select(year, everything())
state_milk_production
## # A tibble: 2,400 x 4
## year region state milk_produced
## <dbl> <chr> <chr> <dbl>
## 1 1970 Northeast Maine 619000000
## 2 1970 Northeast New Hampshire 356000000
## 3 1970 Northeast Vermont 1970000000
## 4 1970 Northeast Massachusetts 658000000
## 5 1970 Northeast Rhode Island 75000000
## 6 1970 Northeast Connecticut 661000000
## 7 1970 Northeast New York 10341000000
## 8 1970 Northeast New Jersey 730000000
## 9 1970 Northeast Pennsylvania 7124000000
## 10 1970 Northeast Delaware 130000000
## # ... with 2,390 more rows
The milkcow_facts is an interesting set in that it is the only set that contains variables that have multiple different unit types. Where the other sets had consistent unit types, such as lbs per person, or pounds produced, milkcow_facts has variables where the unit is a number, a pound, a ratio (dollars per pound), etc. With data that has this many unit types it does not make sense to convert it into a long format and instead we will keep it in a wide format.
#Display tibble of milkcow_facts
milkcow_facts
## # A tibble: 35 x 11
## year avg_milk_cow_numb~ milk_per_cow milk_production_l~ avg_price_milk
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1980 10799000 11891 128406000000 0.13
## 2 1981 10898000 12183 132770000000 0.138
## 3 1982 11011000 12306 135505000000 0.136
## 4 1983 11059000 12622 139588000000 0.136
## 5 1984 10793000 12541 135351000000 0.135
## 6 1985 10981000 13024 143012000000 0.127
## 7 1986 10773000 13285 143124000000 0.125
## 8 1987 10327000 13819 142709000000 0.125
## 9 1988 10224000 14185 145034000000 0.122
## 10 1989 10046000 14323 143893000000 0.136
## # ... with 25 more rows, and 6 more variables: dairy_ration <dbl>,
## # milk_feed_price_ratio <dbl>, milk_cow_cost_per_animal <dbl>,
## # milk_volume_to_buy_cow_in_lbs <dbl>, alfalfa_hay_price <dbl>,
## # slaughter_cow_price <dbl>
Interestingly, state_milk_production has data recorded from 1970 - 2017 compared to milkcow_facts which only has data from 1980 - 2014. Using the following code, we can create a new data farm which represents total milk production from 1970 - 2017 in the United States which will be called yearly_milk_production:
#Group and sum of yearly_milk_production
yearly_milk_production <- state_milk_production %>%
group_by(year) %>%
summarise(total_produced = sum(milk_produced))
yearly_milk_production
## # A tibble: 48 x 2
## year total_produced
## <dbl> <dbl>
## 1 1970 117007000000
## 2 1971 118566000000
## 3 1972 120025000000
## 4 1973 115491000000
## 5 1974 115586000000
## 6 1975 115398000000
## 7 1976 120180000000
## 8 1977 122654000000
## 9 1978 121461000000
## 10 1979 123351000000
## # ... with 38 more rows
Without going too much into the analysis, we can do a quick plot of yearly production which shows what was noted in the problem statement, that dairy production has increased significantly over the last 40 years:
#Plot of yearly_milk_production by year
yearly_milk_production %>%
ggplot(aes(year, total_produced)) +
geom_line() +
labs(title = "Dairy production in the US",
x = "Year",
y = "Milk produced (lbs)")
While dairy product consumption is decreasing:
#Plot of yearly_dairy_consumption by year
yearly_dairy_consumption %>%
ggplot(aes(year, total_consumed)) +
geom_line() +
labs(title = "Dairy consumption in the US",
x = "Year",
y = "Average consumption per person (lbs)")
At this point we have seven tidy data frames on which to start our analysis:
Additionally, it does not make sense to join or combine these different sets into one large data set due to all of the different unit types. As the analysis progresses, we may discover that certain variables do not contain useful information and may subsequently be dropped allowing for the creation of one large data set. However, at the risk of prematurely deleting data, having multiple, smaller, tidy sets appears to be the most efficient method to analyze and visualize the data
As we have shown in the two graphs above, with a little manipulation of the data, dairy production in the US has been increasing while dairy consumption has been decreasing. What is interesting about this trend is that it has not happened suddenly, the decline in consumption has been happening since the start of the recording period. One would think that in a market where product consumption has been steadily declining, production would decline as well so that production meets demand.
This disparity between production and consumption will be the target of our future analysis. Questions that we will pursue include:
milkcow_facts and state_milk_production sets, there are multiple variables related to dairy production including production by geographic area, feed prices, milk production per cow, and wholesale value of meat.
milk_products_facts set there is a cheese_american product and in the clean_cheese set there are both American Other and Total American Cheese cheese types, is this a potential target for a join? Do other commonalities exist between data frames?While we can answer many of the above questions with simple graphs such as scatter plots and line graphs, the great thing about time series data is that it provides a great opportunity for linear model building and forecasting. While I am quite familiar with building linear regression models in R, I have not done any forecasting and will need to do some outside reading on how to build a forecasting model with the data frames. On first glance, the clean_cheese_long and milk_products_facts_long data frames look to be the most appropriate for model forecasting to try to identify customer trends.
This is a very interesting data set to work with for a few reasons. First, it is very complete, there are lots of variables that have been continuously recorded from 1970 - 2017 with only a few missing values. Second, the data itself started out relatively clean and though a little bit of data manipulation we now have seven tidy data frames to work with, data was not scattered or linked across multiple locations and each data frame is consistent in its language and the variables presented make sense. Finally, and most importantly, the data set will be used to address a real world problem, that will hopefully provide some insight into how dairy farmers and producers can adapt to changing consumer preferences.