This is a R Markdown document for milk product analysis.

Please follow the each tab in order.

Introduction

Though Americans comsume a good amount of milk products, there is a stockpile of cheese that is massive enough to wrap around the U.S. Capitol.

This happened because farmers have overproduced milk thanks to its high price and have failed to anticipate that American consume less milk than before.

So the farmers turned their extra milks into cheese because it is less perishable and stay fresh for longer period.

I decided to investigate the milk and cheese consumption data to get insights about America’s milk product consumption by providing data visualization and by doing so, to allow users to understand what is the best way to handle the glut of milk.

Packages used

library(readr)   # used to import csv files

library(dplyr)   # used to manipulate data

library(tidyr)   # used to tidy data

library(ggplot2) # used to creat data visualization

Data manipulation

I acquired the data from Tidy Tuesday Dariy Data.

However, they are providing the tidied data that are available from USDA Dairy Data.

Also, information about how the data is assembled can be found from here.


Data importation

clean_cheese <- read_csv("data/clean_cheese.csv")

fluid_milk_sales <- read_csv("data/fluid_milk_sales.csv")

pop <- read_csv("data/pop.csv")

Let’s start by importing the data. The first data is about the cheese consumption per capita and the second data is about the beverage milk consumption.

I am going to compare the milk consumption per capita with the beverage milk consumption per capita. To have a fair comparison, I needed the same scale. I looked for how the data of cheese consumption per capita is measured and used same calculation to get beverage milk consumption per capita.

For this reason, the third data about the resident population plus the Armed Forces overseas is collected from here.

ERS provides annual per capita consumption estimates for major dairy products. For most products, per capita consumption is calculated by dividing domestic disappearance by the U.S. resident population plus armed forces overseas.


Learn about the data

sapply(mget(ls()), dim)
##      clean_cheese fluid_milk_sales pop
## [1,]           48              387  43
## [2,]           17                3   2
sapply(mget(ls()), summary)
## $clean_cheese
##       Year         Cheddar       American Other    Mozzarella    
##  Min.   :1970   Min.   : 5.790   Min.   :1.200   Min.   : 1.190  
##  1st Qu.:1982   1st Qu.: 8.297   1st Qu.:2.220   1st Qu.: 3.183  
##  Median :1994   Median : 9.515   Median :2.585   Median : 7.695  
##  Mean   :1994   Mean   : 8.889   Mean   :2.618   Mean   : 6.861  
##  3rd Qu.:2005   3rd Qu.: 9.920   3rd Qu.:2.965   3rd Qu.: 9.967  
##  Max.   :2017   Max.   :11.070   Max.   :3.990   Max.   :11.730  
##                                                                  
##  Italian other       Swiss           Brick            Muenster     
##  Min.   :0.870   Min.   :0.880   Min.   :0.01000   Min.   :0.1700  
##  1st Qu.:1.522   1st Qu.:1.067   1st Qu.:0.03000   1st Qu.:0.2775  
##  Median :2.185   Median :1.170   Median :0.05000   Median :0.3350  
##  Mean   :2.153   Mean   :1.155   Mean   :0.05396   Mean   :0.3402  
##  3rd Qu.:2.752   3rd Qu.:1.240   3rd Qu.:0.07250   3rd Qu.:0.4025  
##  Max.   :3.490   Max.   :1.350   Max.   :0.12000   Max.   :0.5300  
##                                                                    
##  Cream and Neufchatel      Blue        Other Dairy Cheese Processed Cheese
##  Min.   :0.610        Min.   :0.1500   Min.   :0.410      Min.   :3.310   
##  1st Qu.:1.100        1st Qu.:0.1600   1st Qu.:0.750      1st Qu.:3.817   
##  Median :2.050        Median :0.1700   Median :0.965      Median :4.415   
##  Mean   :1.744        Mean   :0.1981   Mean   :1.018      Mean   :4.329   
##  3rd Qu.:2.350        3rd Qu.:0.1825   3rd Qu.:1.295      3rd Qu.:4.805   
##  Max.   :2.640        Max.   :0.3200   Max.   :1.590      Max.   :5.440   
##                       NA's   :12                                          
##  Foods and spreads Total American Chese Total Italian Cheese
##  Min.   :1.910     Min.   : 7.00        Min.   : 2.050      
##  1st Qu.:2.987     1st Qu.:10.81        1st Qu.: 4.685      
##  Median :3.220     Median :11.83        Median : 9.945      
##  Mean   :3.164     Mean   :11.51        Mean   : 9.012      
##  3rd Qu.:3.440     3rd Qu.:12.84        3rd Qu.:12.727      
##  Max.   :3.980     Max.   :15.06        Max.   :15.210      
##                                                             
##  Total Natural Cheese Total Processed Cheese Products
##  Min.   :11.37        Min.   :5.530                  
##  1st Qu.:19.47        1st Qu.:6.897                  
##  Median :26.29        Median :7.595                  
##  Mean   :25.35        Mean   :7.492                  
##  3rd Qu.:31.78        3rd Qu.:8.195                  
##  Max.   :37.23        Max.   :8.750                  
##                                                      
## 
## $fluid_milk_sales
##       year       milk_type             pounds         
##  Min.   :1975   Length:387         Min.   :7.600e+07  
##  1st Qu.:1985   Class :character   1st Qu.:8.365e+08  
##  Median :1996   Mode  :character   Median :3.920e+09  
##  Mean   :1996                      Mean   :1.196e+10  
##  3rd Qu.:2007                      3rd Qu.:1.748e+10  
##  Max.   :2017                      Max.   :5.553e+10  
## 
## $pop
##       year        population   
##  Min.   :1975   Min.   :216.0  
##  1st Qu.:1986   1st Qu.:239.6  
##  Median :1996   Median :269.7  
##  Mean   :1996   Mean   :270.3  
##  3rd Qu.:2006   3rd Qu.:300.3  
##  Max.   :2017   Max.   :328.0
sapply(mget(ls()), head)
## $clean_cheese
## # A tibble: 6 x 17
##    Year Cheddar `American Other` Mozzarella `Italian other` Swiss Brick
##   <dbl>   <dbl>            <dbl>      <dbl>           <dbl> <dbl> <dbl>
## 1  1970    5.79             1.2        1.19            0.87  0.88  0.1 
## 2  1971    5.91             1.42       1.38            0.92  0.94  0.11
## 3  1972    6.01             1.67       1.57            1.02  1.06  0.1 
## 4  1973    6.07             1.76       1.76            1.03  1.06  0.11
## 5  1974    6.31             2.16       1.86            1.09  1.18  0.11
## 6  1975    6.04             2.11       2.11            1.12  1.09  0.09
## # ... with 10 more variables: Muenster <dbl>, `Cream and
## #   Neufchatel` <dbl>, Blue <dbl>, `Other Dairy Cheese` <dbl>, `Processed
## #   Cheese` <dbl>, `Foods and spreads` <dbl>, `Total American
## #   Chese` <dbl>, `Total Italian Cheese` <dbl>, `Total Natural
## #   Cheese` <dbl>, `Total Processed Cheese Products` <dbl>
## 
## $fluid_milk_sales
## # A tibble: 6 x 3
##    year milk_type      pounds
##   <dbl> <chr>           <dbl>
## 1  1975 Whole     36188000000
## 2  1976 Whole     35241000000
## 3  1977 Whole     34036000000
## 4  1978 Whole     33235000000
## 5  1979 Whole     32480000000
## 6  1980 Whole     31253000000
## 
## $pop
## # A tibble: 6 x 2
##    year population
##   <dbl>      <dbl>
## 1  1975       216.
## 2  1976       218.
## 3  1977       220.
## 4  1978       223.
## 5  1979       225.
## 6  1980       228.
sapply(mget(ls()), tail)
## $clean_cheese
## # A tibble: 6 x 17
##    Year Cheddar `American Other` Mozzarella `Italian other` Swiss Brick
##   <dbl>   <dbl>            <dbl>      <dbl>           <dbl> <dbl> <dbl>
## 1  2012    9.59             3.66       10.7            3.11  1.09  0.04
## 2  2013    9.64             3.71       10.7            3.1   1     0.03
## 3  2014    9.85             3.8        11.2            3.02  1.02  0.01
## 4  2015   10.2              3.86       11.3            3.2   1.05  0.01
## 5  2016   10.4              3.95       11.7            3.49  1.06  0.01
## 6  2017   11.1              3.99       11.6            3.49  1.05  0.01
## # ... with 10 more variables: Muenster <dbl>, `Cream and
## #   Neufchatel` <dbl>, Blue <dbl>, `Other Dairy Cheese` <dbl>, `Processed
## #   Cheese` <dbl>, `Foods and spreads` <dbl>, `Total American
## #   Chese` <dbl>, `Total Italian Cheese` <dbl>, `Total Natural
## #   Cheese` <dbl>, `Total Processed Cheese Products` <dbl>
## 
## $fluid_milk_sales
## # A tibble: 6 x 3
##    year milk_type              pounds
##   <dbl> <chr>                   <dbl>
## 1  2012 Total Production 53442153871.
## 2  2013 Total Production 52285761122.
## 3  2014 Total Production 50773151199.
## 4  2015 Total Production 50037811066.
## 5  2016 Total Production 49699917113.
## 6  2017 Total Production 48628692047.
## 
## $pop
## # A tibble: 6 x 2
##    year population
##   <dbl>      <dbl>
## 1  2012       314.
## 2  2013       317.
## 3  2014       319.
## 4  2015       323.
## 5  2016       325.
## 6  2017       328.

One important point I could noticed was that Blue cheese values from year 1998 to 2009 are missing.

I searched for the reason why the values are missing and found this foot note from the original data:

NA= not available; USDA, National Agricultural Statistics Service did not report blue cheese production from 1998 through 2009 to avoid disclosing data from individual operations.

So I decided to leave the missing values for now because I felt that they explain what they exactly mean.


Clean & tidying cheese data

clean_cheese <- clean_cheese[, c(1:10)]

names(clean_cheese) <- c("year", "cheddar", "american_other", "mozzarella",
                         "italian_other", "swiss", "brick", "muenster",
                         "cream_and_neufchatel", "blue")

clean_cheese <- 
  clean_cheese %>%
    gather("cheese_type", "pounds", -1)

american <- c("cheddar", "american_other")

italian <- c("mozzarella", "italian_other")

clean_cheese$category <- 
  case_when(
    clean_cheese$cheese_type %in% american ~ "american",
    clean_cheese$cheese_type %in% italian ~ "italian",
    TRUE ~ "other"
  )

clean_cheese <- clean_cheese[,c(1, 4, 2, 3)]

clean_cheese$category <- as.factor(clean_cheese$category)

clean_cheese$cheese_type <- as.factor(clean_cheese$cheese_type)

To clean the cheese data to perform my analysis, I needed to leave only the cheese from cows. I removed unnecessary columns with the first code.

Second code is written just to have the column name format that I prefer.

Third code is written to tidy the data.

Then, I wanted to have another column that categorizes the type of cheese. other type means other types of cheeses than american and italian typically from cows.

I reordered the column and made the category and cheese_type column into factor data type.


Clean & tidying beverage milk data

fluid_milk_sales <- fluid_milk_sales %>%
  spread(milk_type, pounds)

fluid_milk_sales <- fluid_milk_sales[, c(1:8, 10)]

names(fluid_milk_sales) <- c("year", "buttermilk", "eggnog", 
                         "flavored_not_whole", "flavored_whole", 
                         "low_fat_1%", "reduced_fat_2%", "skim",
                         "whole")

fluid_milk_sales <- fluid_milk_sales %>%
  gather("milk_type", "pounds", -1)

pop$population <- pop$population * 1000000

fluid_milk_sales <- 
  fluid_milk_sales %>%
    left_join(pop, by = "year") %>%
    mutate(
      pounds = pounds / population
    ) %>%
    select(
      year,
      milk_type,
      pounds
    )

fluid_milk_sales$milk_type <- as.factor(fluid_milk_sales$milk_type)

I did pretty much the same thing like I did for cheese data to beverage milk data.

One thing to notice from here is that, as I mentioned before, I made the same scale for the two main dataset for fair comparison.

The population data I got was in a million to have actual value, I multiplied 1,000,000 to that vector.

Then I coded to have a pounds value that represents beverage milk consumption per capital in pounds.


Data visualization

Cheese consumption

clean_cheese %>%
  ggplot(aes(x = year, y = pounds, col = cheese_type)) +
  geom_line() +
  theme_minimal()

This line graph shows the consumption of each cheese type by year.

The most noticable point is the dramatic increase of consumption of mozzarella cheese.


clean_cheese %>%
  group_by(year, category) %>%
  summarise(pounds = sum(pounds, na.rm = TRUE)) %>%
  ggplot(aes(x = year, y = pounds, col = category)) +
  geom_line() +
  theme_minimal()

This second line graph shows the consumption of each cheese category by year.

The consumption for all of three categories are increasing.


Beverage milk consumption

fluid_milk_sales %>%
  ggplot(aes(x = year, y = pounds, col = milk_type)) +
  geom_line() +
  theme_minimal()

This line graph shows the consumption of each beverage milk products by year.

There is a dramatic decrease in consumption of whole milk.


Cheese & beverage milk consumption

cheese_year <- 
  clean_cheese %>%
    group_by(year) %>%
    summarise(cheese = sum(pounds, na.rm = TRUE))

milk_year <-
  fluid_milk_sales %>%
    group_by(year) %>%
    summarise(beverage_milk = sum(pounds, na.rm = TRUE))

cheese_year %>%
  left_join(milk_year, by = "year") %>%
  gather(type, pounds, -1) %>%
  ggplot(aes(x = year, y = pounds, col = type)) +
  geom_line() +
  theme_minimal()
## Warning: Removed 5 rows containing missing values (geom_path).

For better comparison of cheese and milk consumption, I created this line graph.

We can see that the milk consumption experienced huge decrease, whereas the cheese consumption experienced increase.

However, still the demand for cheese is not too great.


Summary

The analysis helped me better understand the consumption of milk products.

By creating the visualization, we could better see that the milk consumption actually experienced huge decrease, while the cheese consumption experienced the opposite. However, we can conclude that the farmers’ decision to turn their extra milk into cheese won’t be a perfect solution of the overproduction of milk.

However, we could find out that if the farmers want to turn their extra milk to cheese, it would be beneficial to turn it into cheddar or mozarella cheese!

Thank you.