This is a R Markdown document for milk product analysis.
Please follow the each tab in order.
Though Americans comsume a good amount of milk products, there is a stockpile of cheese that is massive enough to wrap around the U.S. Capitol.
This happened because farmers have overproduced milk thanks to its high price and have failed to anticipate that American consume less milk than before.
So the farmers turned their extra milks into cheese because it is less perishable and stay fresh for longer period.
I decided to investigate the milk and cheese consumption data to get insights about America’s milk product consumption by providing data visualization and by doing so, to allow users to understand what is the best way to handle the glut of milk.
library(readr) # used to import csv files
library(dplyr) # used to manipulate data
library(tidyr) # used to tidy data
library(ggplot2) # used to creat data visualization
I acquired the data from Tidy Tuesday Dariy Data.
However, they are providing the tidied data that are available from USDA Dairy Data.
Also, information about how the data is assembled can be found from here.
clean_cheese <- read_csv("data/clean_cheese.csv")
fluid_milk_sales <- read_csv("data/fluid_milk_sales.csv")
pop <- read_csv("data/pop.csv")
Let’s start by importing the data. The first data is about the cheese consumption per capita and the second data is about the beverage milk consumption.
I am going to compare the milk consumption per capita with the beverage milk consumption per capita. To have a fair comparison, I needed the same scale. I looked for how the data of cheese consumption per capita is measured and used same calculation to get beverage milk consumption per capita.
For this reason, the third data about the resident population plus the Armed Forces overseas is collected from here.
ERS provides annual per capita consumption estimates for major dairy products. For most products, per capita consumption is calculated by dividing domestic disappearance by the U.S. resident population plus armed forces overseas.
sapply(mget(ls()), dim)
## clean_cheese fluid_milk_sales pop
## [1,] 48 387 43
## [2,] 17 3 2
sapply(mget(ls()), summary)
## $clean_cheese
## Year Cheddar American Other Mozzarella
## Min. :1970 Min. : 5.790 Min. :1.200 Min. : 1.190
## 1st Qu.:1982 1st Qu.: 8.297 1st Qu.:2.220 1st Qu.: 3.183
## Median :1994 Median : 9.515 Median :2.585 Median : 7.695
## Mean :1994 Mean : 8.889 Mean :2.618 Mean : 6.861
## 3rd Qu.:2005 3rd Qu.: 9.920 3rd Qu.:2.965 3rd Qu.: 9.967
## Max. :2017 Max. :11.070 Max. :3.990 Max. :11.730
##
## Italian other Swiss Brick Muenster
## Min. :0.870 Min. :0.880 Min. :0.01000 Min. :0.1700
## 1st Qu.:1.522 1st Qu.:1.067 1st Qu.:0.03000 1st Qu.:0.2775
## Median :2.185 Median :1.170 Median :0.05000 Median :0.3350
## Mean :2.153 Mean :1.155 Mean :0.05396 Mean :0.3402
## 3rd Qu.:2.752 3rd Qu.:1.240 3rd Qu.:0.07250 3rd Qu.:0.4025
## Max. :3.490 Max. :1.350 Max. :0.12000 Max. :0.5300
##
## Cream and Neufchatel Blue Other Dairy Cheese Processed Cheese
## Min. :0.610 Min. :0.1500 Min. :0.410 Min. :3.310
## 1st Qu.:1.100 1st Qu.:0.1600 1st Qu.:0.750 1st Qu.:3.817
## Median :2.050 Median :0.1700 Median :0.965 Median :4.415
## Mean :1.744 Mean :0.1981 Mean :1.018 Mean :4.329
## 3rd Qu.:2.350 3rd Qu.:0.1825 3rd Qu.:1.295 3rd Qu.:4.805
## Max. :2.640 Max. :0.3200 Max. :1.590 Max. :5.440
## NA's :12
## Foods and spreads Total American Chese Total Italian Cheese
## Min. :1.910 Min. : 7.00 Min. : 2.050
## 1st Qu.:2.987 1st Qu.:10.81 1st Qu.: 4.685
## Median :3.220 Median :11.83 Median : 9.945
## Mean :3.164 Mean :11.51 Mean : 9.012
## 3rd Qu.:3.440 3rd Qu.:12.84 3rd Qu.:12.727
## Max. :3.980 Max. :15.06 Max. :15.210
##
## Total Natural Cheese Total Processed Cheese Products
## Min. :11.37 Min. :5.530
## 1st Qu.:19.47 1st Qu.:6.897
## Median :26.29 Median :7.595
## Mean :25.35 Mean :7.492
## 3rd Qu.:31.78 3rd Qu.:8.195
## Max. :37.23 Max. :8.750
##
##
## $fluid_milk_sales
## year milk_type pounds
## Min. :1975 Length:387 Min. :7.600e+07
## 1st Qu.:1985 Class :character 1st Qu.:8.365e+08
## Median :1996 Mode :character Median :3.920e+09
## Mean :1996 Mean :1.196e+10
## 3rd Qu.:2007 3rd Qu.:1.748e+10
## Max. :2017 Max. :5.553e+10
##
## $pop
## year population
## Min. :1975 Min. :216.0
## 1st Qu.:1986 1st Qu.:239.6
## Median :1996 Median :269.7
## Mean :1996 Mean :270.3
## 3rd Qu.:2006 3rd Qu.:300.3
## Max. :2017 Max. :328.0
sapply(mget(ls()), head)
## $clean_cheese
## # A tibble: 6 x 17
## Year Cheddar `American Other` Mozzarella `Italian other` Swiss Brick
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1970 5.79 1.2 1.19 0.87 0.88 0.1
## 2 1971 5.91 1.42 1.38 0.92 0.94 0.11
## 3 1972 6.01 1.67 1.57 1.02 1.06 0.1
## 4 1973 6.07 1.76 1.76 1.03 1.06 0.11
## 5 1974 6.31 2.16 1.86 1.09 1.18 0.11
## 6 1975 6.04 2.11 2.11 1.12 1.09 0.09
## # ... with 10 more variables: Muenster <dbl>, `Cream and
## # Neufchatel` <dbl>, Blue <dbl>, `Other Dairy Cheese` <dbl>, `Processed
## # Cheese` <dbl>, `Foods and spreads` <dbl>, `Total American
## # Chese` <dbl>, `Total Italian Cheese` <dbl>, `Total Natural
## # Cheese` <dbl>, `Total Processed Cheese Products` <dbl>
##
## $fluid_milk_sales
## # A tibble: 6 x 3
## year milk_type pounds
## <dbl> <chr> <dbl>
## 1 1975 Whole 36188000000
## 2 1976 Whole 35241000000
## 3 1977 Whole 34036000000
## 4 1978 Whole 33235000000
## 5 1979 Whole 32480000000
## 6 1980 Whole 31253000000
##
## $pop
## # A tibble: 6 x 2
## year population
## <dbl> <dbl>
## 1 1975 216.
## 2 1976 218.
## 3 1977 220.
## 4 1978 223.
## 5 1979 225.
## 6 1980 228.
sapply(mget(ls()), tail)
## $clean_cheese
## # A tibble: 6 x 17
## Year Cheddar `American Other` Mozzarella `Italian other` Swiss Brick
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2012 9.59 3.66 10.7 3.11 1.09 0.04
## 2 2013 9.64 3.71 10.7 3.1 1 0.03
## 3 2014 9.85 3.8 11.2 3.02 1.02 0.01
## 4 2015 10.2 3.86 11.3 3.2 1.05 0.01
## 5 2016 10.4 3.95 11.7 3.49 1.06 0.01
## 6 2017 11.1 3.99 11.6 3.49 1.05 0.01
## # ... with 10 more variables: Muenster <dbl>, `Cream and
## # Neufchatel` <dbl>, Blue <dbl>, `Other Dairy Cheese` <dbl>, `Processed
## # Cheese` <dbl>, `Foods and spreads` <dbl>, `Total American
## # Chese` <dbl>, `Total Italian Cheese` <dbl>, `Total Natural
## # Cheese` <dbl>, `Total Processed Cheese Products` <dbl>
##
## $fluid_milk_sales
## # A tibble: 6 x 3
## year milk_type pounds
## <dbl> <chr> <dbl>
## 1 2012 Total Production 53442153871.
## 2 2013 Total Production 52285761122.
## 3 2014 Total Production 50773151199.
## 4 2015 Total Production 50037811066.
## 5 2016 Total Production 49699917113.
## 6 2017 Total Production 48628692047.
##
## $pop
## # A tibble: 6 x 2
## year population
## <dbl> <dbl>
## 1 2012 314.
## 2 2013 317.
## 3 2014 319.
## 4 2015 323.
## 5 2016 325.
## 6 2017 328.
One important point I could noticed was that Blue cheese values from year 1998 to 2009 are missing.
I searched for the reason why the values are missing and found this foot note from the original data:
NA= not available; USDA, National Agricultural Statistics Service did not report blue cheese production from 1998 through 2009 to avoid disclosing data from individual operations.
So I decided to leave the missing values for now because I felt that they explain what they exactly mean.
clean_cheese <- clean_cheese[, c(1:10)]
names(clean_cheese) <- c("year", "cheddar", "american_other", "mozzarella",
"italian_other", "swiss", "brick", "muenster",
"cream_and_neufchatel", "blue")
clean_cheese <-
clean_cheese %>%
gather("cheese_type", "pounds", -1)
american <- c("cheddar", "american_other")
italian <- c("mozzarella", "italian_other")
clean_cheese$category <-
case_when(
clean_cheese$cheese_type %in% american ~ "american",
clean_cheese$cheese_type %in% italian ~ "italian",
TRUE ~ "other"
)
clean_cheese <- clean_cheese[,c(1, 4, 2, 3)]
clean_cheese$category <- as.factor(clean_cheese$category)
clean_cheese$cheese_type <- as.factor(clean_cheese$cheese_type)
To clean the cheese data to perform my analysis, I needed to leave only the cheese from cows. I removed unnecessary columns with the first code.
Second code is written just to have the column name format that I prefer.
Third code is written to tidy the data.
Then, I wanted to have another column that categorizes the type of cheese. other type means other types of cheeses than american and italian typically from cows.
I reordered the column and made the category and cheese_type column into factor data type.
fluid_milk_sales <- fluid_milk_sales %>%
spread(milk_type, pounds)
fluid_milk_sales <- fluid_milk_sales[, c(1:8, 10)]
names(fluid_milk_sales) <- c("year", "buttermilk", "eggnog",
"flavored_not_whole", "flavored_whole",
"low_fat_1%", "reduced_fat_2%", "skim",
"whole")
fluid_milk_sales <- fluid_milk_sales %>%
gather("milk_type", "pounds", -1)
pop$population <- pop$population * 1000000
fluid_milk_sales <-
fluid_milk_sales %>%
left_join(pop, by = "year") %>%
mutate(
pounds = pounds / population
) %>%
select(
year,
milk_type,
pounds
)
fluid_milk_sales$milk_type <- as.factor(fluid_milk_sales$milk_type)
I did pretty much the same thing like I did for cheese data to beverage milk data.
One thing to notice from here is that, as I mentioned before, I made the same scale for the two main dataset for fair comparison.
The population data I got was in a million to have actual value, I multiplied 1,000,000 to that vector.
Then I coded to have a pounds value that represents beverage milk consumption per capital in pounds.
clean_cheese %>%
ggplot(aes(x = year, y = pounds, col = cheese_type)) +
geom_line() +
theme_minimal()
This line graph shows the consumption of each cheese type by year.
The most noticable point is the dramatic increase of consumption of mozzarella cheese.
clean_cheese %>%
group_by(year, category) %>%
summarise(pounds = sum(pounds, na.rm = TRUE)) %>%
ggplot(aes(x = year, y = pounds, col = category)) +
geom_line() +
theme_minimal()
This second line graph shows the consumption of each cheese category by year.
The consumption for all of three categories are increasing.
fluid_milk_sales %>%
ggplot(aes(x = year, y = pounds, col = milk_type)) +
geom_line() +
theme_minimal()
This line graph shows the consumption of each beverage milk products by year.
There is a dramatic decrease in consumption of whole milk.
cheese_year <-
clean_cheese %>%
group_by(year) %>%
summarise(cheese = sum(pounds, na.rm = TRUE))
milk_year <-
fluid_milk_sales %>%
group_by(year) %>%
summarise(beverage_milk = sum(pounds, na.rm = TRUE))
cheese_year %>%
left_join(milk_year, by = "year") %>%
gather(type, pounds, -1) %>%
ggplot(aes(x = year, y = pounds, col = type)) +
geom_line() +
theme_minimal()
## Warning: Removed 5 rows containing missing values (geom_path).
For better comparison of cheese and milk consumption, I created this line graph.
We can see that the milk consumption experienced huge decrease, whereas the cheese consumption experienced increase.
However, still the demand for cheese is not too great.
The analysis helped me better understand the consumption of milk products.
By creating the visualization, we could better see that the milk consumption actually experienced huge decrease, while the cheese consumption experienced the opposite. However, we can conclude that the farmers’ decision to turn their extra milk into cheese won’t be a perfect solution of the overproduction of milk.
However, we could find out that if the farmers want to turn their extra milk to cheese, it would be beneficial to turn it into cheddar or mozarella cheese!
Thank you.