Data Wrangling Midterm - Project Proposal

Problem Statement
Dataset Importing
Data Dictionary
Data Cleaning
- Data Cleaning Summary
Proposed EDA

Problem Statement

America has a problem with dairy, more specifically the overproduction of dairy products and American’s declining consumption of those products. The lack of dairy consumption in the United States as referenced in pieces by NPR and The Washington Post show that while milk production in the US has increased by 13% over the past 10 years, milk consumption is down nearly 40% over the same time period. In order to deal with this milk surplus, producers have turned to making other dairy products, such as cheese, to reduce spoilage. Less perishable items like cheese, canned and dry milk products, and even ice cream require additional manufacturing, storage, and transportation costs that ultimately come back to the consumer and dairy farmer. As for storage, this is turning into a large problem as the amount of cheese sitting in warehouses has reached a record high (Fig 1). Large and increasing numbers of dairy products sitting in storage combined with the current political climate regarding trade wars and tariffs should be of great concern to farmers and producers because if there is excess inventory, prices will likely fall.

This analysis will examine the many factors that go into dairy production and consumption to see if any insights can be found that may address the growing disparity between dairy production and consumption. My hypothesis, before any kind of data analysis has been performed, is if trends in consumer consumption of certain dairy products can be established over time, then producers can better focus their efforts at maximizing those sectors of the market which will hopefully boost sales and decrease the amount of dairy products that either sit in storage or go to waste.

Fig 1: Cheese stockpile Image Source

Dataset Importing

The data set used in this analysis comes from the USDA and represents many different variables related to dairy production and consumption spanning from 1970-2017. The data set was spawned from a GitHub tidytuesday project, where the goal of #TidyTuesday is to provide a forum for practicing data wrangling skills.

The data is stored in csv (comma-separated values) files and includes:

milkcow_facts.csv

milk_products_facts.csv

fluid_milk_sales.csv

clean_cheese.csv

state_milk_production.csv

As csv files are usually difficult to interpret in their raw form we will be using the readr and tidyverse packages to both import and manipulate our raw data set. The readr package will be used to import the data from GitHub and the tidyverse package contains multiple packages such as dplyr, purrr, and ggplot2 which will be used for the creation of data frames, data cleaning, and visualization.

First, we must install and load the packages using:

#Make sure readr and tidyverse are installed first using:
#install.packages(c("readr","tidyverse"))
library(readr)
library(tidyverse)

Next, we must import the five csv files from their GitHub sources using read_csv from the readr package. Additionally, for ease of data visualization we will be converting each of the five csv files to tibbles using the as.tibble command from the dplyr package that exists in tidyverse. These two steps are accomplished using the following:

#Import csv files from online source

#Import milkcow_facts.csv
url_csv <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-29/milkcow_facts.csv"
milkcow_facts <- read_csv(url_csv)
milkcow_facts <- as.tibble(milkcow_facts)

#Import milk_products_facts.csv
url_csv <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-29/milk_products_facts.csv"
milk_products_facts <- read_csv(url_csv)
milk_products_facts <- as.tibble(milk_products_facts)

#Import fluid_milk_sales
url_csv <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-29/fluid_milk_sales.csv"
fluid_milk_sales <- read_csv(url_csv)
fluid_milk_sales <- as.tibble(fluid_milk_sales)

#Import clean_cheese
url_csv <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-29/clean_cheese.csv"
clean_cheese <- read_csv(url_csv)
clean_cheese <- as.tibble(clean_cheese)

#Import state_milk_production
url_csv <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-29/state_milk_production.csv"
state_milk_production <- read_csv(url_csv)
state_milk_production <- as.tibble(state_milk_production)

Please be aware that if the file location url changes on GitHub, this will prevent the csv files from being properly imported into R.

Data Dictionary

Now that the csv files have been imported, one should open them up to get a sense of the information that is contained in the raw data. The data dictionary for the five csv files is:

milkcow_facts

variable	class	Description
year	date	Year
avg_milk_cow_number	double	Average number of milk cows
milk_per_cow	double	Average milk production/cow in pounds
milk_production_lbs	double	Total Milk production in pounds
avg_price_milk	double	Average price paid for milk (dollars per pound)
dairy_ration	double	Average price paid for dairy cow rations (dollars per pound)
milk_feed_price_ratio	double	Ratio of average price of milk per dairy cow ration
milk_cow_cost_per_animal	double	Average cost of milk cow per animal (dollars)
milk_volume_to_buy_cow_in_lbs	double	Milk volume required to purchase a cow (pounds)
alfalfa_hay_price	double	Alfalfa hay price received by farmers (tons)
slaughter_cow_price	double	Slaughter cow price (value of meat in dollars per pound)

fluid_milk_sales

variable	class	Description
year	date	Year
milk_type	integer	Category of Milk product
pounds	double	Pounds of milk product per year

milk_products_facts.csv

variable	class	Description
year	date	Year
fluid_milk	double	Average milk consumption in lbs per person
fluid_yogurt	double	Average yogurt consumption in lbs per person
butter	double	Average butter consumption in lbs per person
cheese_american	double	Average American cheese consumption in lbs per person
cheese_other	double	Average other cheese consumption in lbs per person
cheese_cottage	double	Average cottage cheese consumption in lbs per person
evap_cnd_canned_whole_milk	double	Average evaporated and canned whole milk consumption in lbs per person
evap_cnd_bulk_whole_milk	double	Average evaporated and canned bulk whole milk consumption in lbs per person
evap_cnd_bulk_and_can_skim_milk	double	Average evaporated and canned bulk and can skim milk consumption in lbs per person
frozen_ice_cream_regular	double	Average regular frozen ice cream consumption in lbs per person
frozen_ice_cream_reduced_fat	double	Average reduced fat frozen ice cream consumption in lbs per person
frozen_sherbet	double	Average frozen sherbet consumption in lbs per person
frozen_other	double	Average other frozen milk product consumption in lbs per person
dry_whole_milk	double	Average dry whole milk consumption in lbs per person
dry_nonfat_milk	double	Average dry nonfat milk consumption in lbs per person
dry_buttermilk	double	Average dry buttermilk consumption in lbs per person
dry_whey	double	Average dry whey (milk protein) consumption in lbs per person

clean_cheese

variable	class	Description
year	date	Year
Cheddar	double	Cheddar consumption in lbs per person
American Other	double	American Other consumption in lbs per person
Mozzarella	double	Mozzarella consumption in lbs per person
Italian other	double	Italian other consumption in lbs per person
Swiss	double	Swiss consumption in lbs per person
Brick	double	Brick consumption in lbs per person
Muenster	double	Muenster consumption in lbs per person
Cream and Neufchatel	double	Cream and Neufchatel consumption in lbs per person
Blue	double	Blue consumption in lbs per person
Other Dairy Cheese	double	Other Dairy Cheese consumption in lbs per person
Processed Cheese	double	Processed Cheese consumption in lbs per person
Foods and spreads	double	Foods and spreads consumption in lbs per person
Total American Chese	double	Total American Chese consumption in lbs per person
Total Italian Cheese	double	Total Italian Cheese consumption in lbs per person
Total Natural Cheese	double	Total Natural Cheese consumption in lbs per person
Total Processed Cheese Products	double	Total Processed Cheese Products consumption in lbs per person

state_milk_production

variable	class	Description
region	character	Region of the US
state	character	US State
year	date	Year
milk_produced	double	Pounds of Milk Produced

Data Cleaning

To analyze this data, we again will be using the tidyverse package and specifically many functions from the dplyr and tidyr packages within tidyverse.

Now that we have imported our data, converted it to more manageable tibbles, and have established our data dictionary, we should start looking at our data to see how ‘tidy’ it is. Remembering that tidy data has:

One variable per column
One observation per row

For instance, let us look at a tibble from both the clean_cheese and the fluid_milk_sales data:

#Display tibble of clean_cheese
clean_cheese

## # A tibble: 48 x 17
##     Year Cheddar `American Other` Mozzarella `Italian other` Swiss Brick
##    <dbl>   <dbl>            <dbl>      <dbl>           <dbl> <dbl> <dbl>
##  1  1970    5.79             1.2        1.19            0.87  0.88  0.1 
##  2  1971    5.91             1.42       1.38            0.92  0.94  0.11
##  3  1972    6.01             1.67       1.57            1.02  1.06  0.1 
##  4  1973    6.07             1.76       1.76            1.03  1.06  0.11
##  5  1974    6.31             2.16       1.86            1.09  1.18  0.11
##  6  1975    6.04             2.11       2.11            1.12  1.09  0.09
##  7  1976    6.45             2.44       2.31            1.24  1.25  0.09
##  8  1977    6.8              2.42       2.46            1.26  1.21  0.07
##  9  1978    6.94             2.59       2.68            1.37  1.33  0.08
## 10  1979    6.93             2.67       2.8             1.42  1.35  0.06
## # ... with 38 more rows, and 10 more variables: Muenster <dbl>, `Cream and
## #   Neufchatel` <dbl>, Blue <dbl>, `Other Dairy Cheese` <dbl>, `Processed
## #   Cheese` <dbl>, `Foods and spreads` <dbl>, `Total American
## #   Chese` <dbl>, `Total Italian Cheese` <dbl>, `Total Natural
## #   Cheese` <dbl>, `Total Processed Cheese Products` <dbl>

#Display tibble of fluid_milk_sales
fluid_milk_sales

## # A tibble: 387 x 3
##     year milk_type      pounds
##    <dbl> <chr>           <dbl>
##  1  1975 Whole     36188000000
##  2  1976 Whole     35241000000
##  3  1977 Whole     34036000000
##  4  1978 Whole     33235000000
##  5  1979 Whole     32480000000
##  6  1980 Whole     31253000000
##  7  1981 Whole     30397000000
##  8  1982 Whole     29350000000
##  9  1983 Whole     28871000000
## 10  1984 Whole     28204000000
## # ... with 377 more rows

Here we can see that in the clean_cheese data there are 48 rows and 17 columns representing the years 1970-2017 and 16 different cheese products and their consumption in lbs per person respectively. In the fluid_milk_sales there are 387 rows and 3 columns showing milk type sales per year in pounds. On first pass it seems like each set of data is tidy, but the way the data is structured makes it difficult to analyze. It would make more sense to transform the data in the clean_cheese set from a wide format to a long format. This is accomplished using the gather command in the tidyr package. We will also be correcting data types making sure that columns are characters when appropriate. We will keep year as a dbl to facilitate visualization:

clean_cheese

#Gather and mutate clean_cheese into clean_cheese_long
clean_cheese_long <- clean_cheese %>%
  gather(cheese_type, lbs_per_person, -Year) %>% 
#Correct misspelling of 'Total American Chese' to 'Total American Cheese'
  mutate(cheese_type = fct_recode(cheese_type, "Total American Cheese" = "Total American Chese"),
#Convert cheese_type from factor to character
         cheese_type = as.character(cheese_type)) %>% 
#Rename Year to year to fit with the rest of the data
    rename(year = Year)
clean_cheese_long

## # A tibble: 768 x 3
##     year cheese_type lbs_per_person
##    <dbl> <chr>                <dbl>
##  1  1970 Cheddar               5.79
##  2  1971 Cheddar               5.91
##  3  1972 Cheddar               6.01
##  4  1973 Cheddar               6.07
##  5  1974 Cheddar               6.31
##  6  1975 Cheddar               6.04
##  7  1976 Cheddar               6.45
##  8  1977 Cheddar               6.8 
##  9  1978 Cheddar               6.94
## 10  1979 Cheddar               6.93
## # ... with 758 more rows

Additionally, clean_cheese is the only set that has missing values. We can identify the variable that has the missing values using the following code. For now we will leave in observations with missing values and the number of overall missing values is quite small:

#Discover NA values
colSums(is.na(clean_cheese))

##                            Year                         Cheddar 
##                               0                               0 
##                  American Other                      Mozzarella 
##                               0                               0 
##                   Italian other                           Swiss 
##                               0                               0 
##                           Brick                        Muenster 
##                               0                               0 
##            Cream and Neufchatel                            Blue 
##                               0                              12 
##              Other Dairy Cheese                Processed Cheese 
##                               0                               0 
##               Foods and spreads            Total American Chese 
##                               0                               0 
##            Total Italian Cheese            Total Natural Cheese 
##                               0                               0 
## Total Processed Cheese Products 
##                               0

milk_products_facts

While we can also apply the gather command to the milk_products_facts data frame, it will be helpful later on in our analysis to have an additional column that represents the total average dairy product consumption in pounds per person. This total dairy consumption can then be grouped by year into the data frame yearly_dairy_consumption.

#Gather and mutate milk_products_facts into milk_products_facts_long
milk_products_facts <- milk_products_facts %>% 
  mutate(total_dairy = rowSums(.[2:17]))

milk_products_facts_long <- milk_products_facts %>% 
    gather(product, lbs_per_person, -year)
milk_products_facts_long

## # A tibble: 774 x 3
##     year product    lbs_per_person
##    <dbl> <chr>               <dbl>
##  1  1975 fluid_milk            247
##  2  1976 fluid_milk            247
##  3  1977 fluid_milk            244
##  4  1978 fluid_milk            241
##  5  1979 fluid_milk            238
##  6  1980 fluid_milk            234
##  7  1981 fluid_milk            230
##  8  1982 fluid_milk            224
##  9  1983 fluid_milk            223
## 10  1984 fluid_milk            224
## # ... with 764 more rows

#Creation of new data frame yearly_dairy_consumption
yearly_dairy_consumption <- milk_products_facts %>%
  group_by(year) %>%
  summarise(total_consumed = sum(total_dairy))
yearly_dairy_consumption

## # A tibble: 43 x 2
##     year total_consumed
##    <dbl>          <dbl>
##  1  1975           313.
##  2  1976           313.
##  3  1977           310.
##  4  1978           307.
##  5  1979           303.
##  6  1980           299.
##  7  1981           295.
##  8  1982           291.
##  9  1983           292.
## 10  1984           295.
## # ... with 33 more rows

When looking at both the clean_cheese and milk_products_facts sets one may be tempted to join the two sets together. However, at this point in the analysis that may not be the best idea as there seems to be similar but slightly different data in the two sets. For instance in the milk_products_facts set there is a cheese_american product and in the clean_cheese set there are both American Other and Total American Cheese cheese types. It is too early in the analysis to join these two sets and assume they represent the same data.

fluid_milk_sales

As noted above, the fluid_milk_sales set is already in a long format so in sticking with our data convention we will not change it.

#Display tibble of fluid_milk_sales
fluid_milk_sales

## # A tibble: 387 x 3
##     year milk_type      pounds
##    <dbl> <chr>           <dbl>
##  1  1975 Whole     36188000000
##  2  1976 Whole     35241000000
##  3  1977 Whole     34036000000
##  4  1978 Whole     33235000000
##  5  1979 Whole     32480000000
##  6  1980 Whole     31253000000
##  7  1981 Whole     30397000000
##  8  1982 Whole     29350000000
##  9  1983 Whole     28871000000
## 10  1984 Whole     28204000000
## # ... with 377 more rows

state_milk_production

The next set, state_milk_production is also in a long format but the column order is slightly confusing. We will rearrange the column order so year is the first column with the other columns representing region, State, and pounds of milk produced:

#Rearrange and display tibble of state_milk_production
state_milk_production <- state_milk_production %>%
  select(year, everything())
state_milk_production

## # A tibble: 2,400 x 4
##     year region    state         milk_produced
##    <dbl> <chr>     <chr>                 <dbl>
##  1  1970 Northeast Maine             619000000
##  2  1970 Northeast New Hampshire     356000000
##  3  1970 Northeast Vermont          1970000000
##  4  1970 Northeast Massachusetts     658000000
##  5  1970 Northeast Rhode Island       75000000
##  6  1970 Northeast Connecticut       661000000
##  7  1970 Northeast New York        10341000000
##  8  1970 Northeast New Jersey        730000000
##  9  1970 Northeast Pennsylvania     7124000000
## 10  1970 Northeast Delaware          130000000
## # ... with 2,390 more rows

milkcow_facts

The milkcow_facts is an interesting set in that it is the only set that contains variables that have multiple different unit types. Where the other sets had consistent unit types, such as lbs per person, or pounds produced, milkcow_facts has variables where the unit is a number, a pound, a ratio (dollars per pound), etc. With data that has this many unit types it does not make sense to convert it into a long format and instead we will keep it in a wide format.

#Display tibble of milkcow_facts
milkcow_facts

## # A tibble: 35 x 11
##     year avg_milk_cow_numb~ milk_per_cow milk_production_l~ avg_price_milk
##    <dbl>              <dbl>        <dbl>              <dbl>          <dbl>
##  1  1980           10799000        11891       128406000000          0.13 
##  2  1981           10898000        12183       132770000000          0.138
##  3  1982           11011000        12306       135505000000          0.136
##  4  1983           11059000        12622       139588000000          0.136
##  5  1984           10793000        12541       135351000000          0.135
##  6  1985           10981000        13024       143012000000          0.127
##  7  1986           10773000        13285       143124000000          0.125
##  8  1987           10327000        13819       142709000000          0.125
##  9  1988           10224000        14185       145034000000          0.122
## 10  1989           10046000        14323       143893000000          0.136
## # ... with 25 more rows, and 6 more variables: dairy_ration <dbl>,
## #   milk_feed_price_ratio <dbl>, milk_cow_cost_per_animal <dbl>,
## #   milk_volume_to_buy_cow_in_lbs <dbl>, alfalfa_hay_price <dbl>,
## #   slaughter_cow_price <dbl>

Interestingly, state_milk_production has data recorded from 1970 - 2017 compared to milkcow_facts which only has data from 1980 - 2014. Using the following code, we can create a new data farm which represents total milk production from 1970 - 2017 in the United States which will be called yearly_milk_production:

#Group and sum of yearly_milk_production
yearly_milk_production <- state_milk_production %>%
  group_by(year) %>% 
  summarise(total_produced = sum(milk_produced))
yearly_milk_production

## # A tibble: 48 x 2
##     year total_produced
##    <dbl>          <dbl>
##  1  1970   117007000000
##  2  1971   118566000000
##  3  1972   120025000000
##  4  1973   115491000000
##  5  1974   115586000000
##  6  1975   115398000000
##  7  1976   120180000000
##  8  1977   122654000000
##  9  1978   121461000000
## 10  1979   123351000000
## # ... with 38 more rows

Without going too much into the analysis, we can do a quick plot of yearly production which shows what was noted in the problem statement, that dairy production has increased significantly over the last 40 years:

#Plot of yearly_milk_production by year
yearly_milk_production %>% 
  ggplot(aes(year, total_produced)) +
  geom_line() +
  labs(title = "Dairy production in the US",
       x = "Year",
       y = "Milk produced (lbs)")

While dairy product consumption is decreasing:

#Plot of yearly_dairy_consumption by year
yearly_dairy_consumption %>% 
  ggplot(aes(year, total_consumed)) +
  geom_line() +
  labs(title = "Dairy consumption in the US",
       x = "Year",
       y = "Average consumption per person (lbs)")

Data Cleaning Summary

At this point we have seven tidy data frames on which to start our analysis:

clean_cheese_long
milk_products_facts_long
- yearly_dairy_consumption
fluid_milk_sales
state_milk_production
- yearly_milk_production
milkcow_facts

Additionally, it does not make sense to join or combine these different sets into one large data set due to all of the different unit types. As the analysis progresses, we may discover that certain variables do not contain useful information and may subsequently be dropped allowing for the creation of one large data set. However, at the risk of prematurely deleting data, having multiple, smaller, tidy sets appears to be the most efficient method to analyze and visualize the data

Proposed EDA

As we have shown in the two graphs above, with a little manipulation of the data, dairy production in the US has been increasing while dairy consumption has been decreasing. What is interesting about this trend is that it has not happened suddenly, the decline in consumption has been happening since the start of the recording period. One would think that in a market where product consumption has been steadily declining, production would decline as well so that production meets demand.

This disparity between production and consumption will be the target of our future analysis. Questions that we will pursue include:

We have product data for different dairy products, can we slice and dice the data to breakout individual products to uncover consumer trends?
- If consumer tastes are changing over time can we build a regression model or forecasting model to predict future demand?
Within the milkcow_facts and state_milk_production sets, there are multiple variables related to dairy production including production by geographic area, feed prices, milk production per cow, and wholesale value of meat.
- Has efficiency in the milk production process, that is, has the amount of milk produced per cow, outpaced demand?
- Have costs associated with producing milk changed over time allowing farmers to have this extra capacity?
- Has production in certain geographic areas increased disproportionately compared to the rest of the US? Plotting this geographic data in a time series fashion using heat maps would help address this question.
As mentioned above, while it did not make sense to join the different data frames into one large set, there do seem to be some commonalities between some of the sets
- In the milk_products_facts set there is a cheese_american product and in the clean_cheese set there are both American Other and Total American Cheese cheese types, is this a potential target for a join? Do other commonalities exist between data frames?
Are there external datasets that can be used to determine if the excess production is going somewhere else? Overseas? Other uses?

While we can answer many of the above questions with simple graphs such as scatter plots and line graphs, the great thing about time series data is that it provides a great opportunity for linear model building and forecasting. While I am quite familiar with building linear regression models in R, I have not done any forecasting and will need to do some outside reading on how to build a forecasting model with the data frames. On first glance, the clean_cheese_long and milk_products_facts_long data frames look to be the most appropriate for model forecasting to try to identify customer trends.

This is a very interesting data set to work with for a few reasons. First, it is very complete, there are lots of variables that have been continuously recorded from 1970 - 2017 with only a few missing values. Second, the data itself started out relatively clean and though a little bit of data manipulation we now have seven tidy data frames to work with, data was not scattered or linked across multiple locations and each data frame is consistent in its language and the variables presented make sense. Finally, and most importantly, the data set will be used to address a real world problem, that will hopefully provide some insight into how dairy farmers and producers can adapt to changing consumer preferences.

Data Wrangling Midterm - Project Proposal

US Dairy Analysis

Robert Doerning

2019-04-06

Problem Statement

Dataset Importing

Data Dictionary

milkcow_facts

fluid_milk_sales

milk_products_facts.csv

clean_cheese

state_milk_production

Data Cleaning

clean_cheese

milk_products_facts

fluid_milk_sales

state_milk_production

milkcow_facts

Data Cleaning Summary

Proposed EDA