1. Introduction

1.1

Milk_sales data set indicates that the trend of annual milk sales has been downtrend since 2010. If we are doing milk business and would like to explore new opportunity in this industry, what’s the potential opportunity there and what risk there could be? Which product is most cosumed in U.S? Does it mean this product could be a sun-rising industry in milk industry?

1.2

I will use milkproductfacts dataset and ggplot package in R to find the market trend in Butter, Cheese, Milk, Yogurt and Ice Cream businesses.

1.3

I will use ggplot to visualize the annual milk sales by types to prove my assumption that annual milk sales has been in downtrend. Then I will wrangle data for milkfacts dataset to create four new variables (cheese, evap_milk, ice_cream, dry_milk) by merging several columns together. At last, ggplot package will be deployed to visualize the trends of multiple products.

1.4

My analysis can help customer get a good insight of milk industry. It could support customer with decision making and strategy design to direct company to right track.

2. Packages Required

Clean and tidy dataset

library(tidyverse)

Data visualization

library(ggplot2)

summary statistics

library(dplyr)

correlation matrices

library(Hmisc)
library(corrplot)

3. Data

fluidcow_facts.csv
milk_products_facts.csv

Data comes from the USDA (United States Department of Agriculture). The raw datasets (Excel Sheets) can be found here.

Importing Data

milkcowfacts <- read.csv("C:/Users/sijia/Desktop/BANA7025/milkcow_facts.csv")
milkproductfacts <- read.csv("C:/Users/sijia/Desktop/BANA7025/milk_products_facts.csv")

Data Review and Manipulation

assessing dimesions

dim(milkcowfacts)
## [1] 35 11
dim(milkproductfacts)
## [1] 43 18
names(milkcowfacts)
##  [1] "year"                          "avg_milk_cow_number"          
##  [3] "milk_per_cow"                  "milk_production_lbs"          
##  [5] "avg_price_milk"                "dairy_ration"                 
##  [7] "milk_feed_price_ratio"         "milk_cow_cost_per_animal"     
##  [9] "milk_volume_to_buy_cow_in_lbs" "alfalfa_hay_price"            
## [11] "slaughter_cow_price"
names(milkproductfacts)
##  [1] "year"                            "fluid_milk"                     
##  [3] "fluid_yogurt"                    "butter"                         
##  [5] "cheese_american"                 "cheese_other"                   
##  [7] "cheese_cottage"                  "evap_cnd_canned_whole_milk"     
##  [9] "evap_cnd_bulk_whole_milk"        "evap_cnd_bulk_and_can_skim_milk"
## [11] "frozen_ice_cream_regular"        "frozen_ice_cream_reduced_fat"   
## [13] "frozen_sherbet"                  "frozen_other"                   
## [15] "dry_whole_milk"                  "dry_nonfat_milk"                
## [17] "dry_buttermilk"                  "dry_whey"

Quick review for each dataset

glimpse(milkcowfacts)
## Observations: 35
## Variables: 11
## $ year                          <dbl> 1980, 1981, 1982, 1983, 1984, 19...
## $ avg_milk_cow_number           <dbl> 10799000, 10898000, 11011000, 11...
## $ milk_per_cow                  <int> 11891, 12183, 12306, 12622, 1254...
## $ milk_production_lbs           <dbl> 1.28406e+11, 1.32770e+11, 1.3550...
## $ avg_price_milk                <dbl> 0.130, 0.138, 0.136, 0.136, 0.13...
## $ dairy_ration                  <dbl> 0.04837357, 0.05035243, 0.044221...
## $ milk_feed_price_ratio         <dbl> 2.717149, 2.759031, 3.088127, 2....
## $ milk_cow_cost_per_animal      <int> 1190, 1200, 1110, 1030, 895, 860...
## $ milk_volume_to_buy_cow_in_lbs <dbl> 9153.846, 8695.652, 8161.765, 75...
## $ alfalfa_hay_price             <dbl> 72.00000, 70.90000, 72.73333, 78...
## $ slaughter_cow_price           <dbl> 0.4573000, 0.4193000, 0.3996000,...
glimpse(milkproductfacts)
## Observations: 43
## Variables: 18
## $ year                            <int> 1975, 1976, 1977, 1978, 1979, ...
## $ fluid_milk                      <int> 247, 247, 244, 241, 238, 234, ...
## $ fluid_yogurt                    <dbl> 1.967839, 2.132685, 2.338369, ...
## $ butter                          <dbl> 4.728193, 4.313202, 4.294180, ...
## $ cheese_american                 <dbl> 8.147222, 8.883106, 9.213005, ...
## $ cheese_other                    <dbl> 6.126409, 6.627872, 6.781846, ...
## $ cheese_cottage                  <dbl> 4.588537, 4.632284, 4.617711, ...
## $ evap_cnd_canned_whole_milk      <dbl> 3.949932, 3.791703, 3.265569, ...
## $ evap_cnd_bulk_whole_milk        <dbl> 1.2385962, 1.1008644, 1.003802...
## $ evap_cnd_bulk_and_can_skim_milk <dbl> 3.525306, 3.590506, 3.879376, ...
## $ frozen_ice_cream_regular        <dbl> 18.20505, 17.63845, 17.28895, ...
## $ frozen_ice_cream_reduced_fat    <dbl> 6.502202, 6.169193, 6.574222, ...
## $ frozen_sherbet                  <dbl> 1.348780, 1.364460, 1.356254, ...
## $ frozen_other                    <dbl> 1.816894, 1.678171, 1.627777, ...
## $ dry_whole_milk                  <dbl> 0.1000000, 0.2000000, 0.200000...
## $ dry_nonfat_milk                 <dbl> 3.261769, 3.504864, 3.308311, ...
## $ dry_buttermilk                  <dbl> 0.2000000, 0.2000000, 0.300000...
## $ dry_whey                        <dbl> 2.200000, 2.400000, 2.400000, ...

Identify Outlier, exclude year and milk_type for more meanful result

boxplot(subset(milkcowfacts, select = -c(year)))

boxplot(subset(milkproductfacts, select = -c(year)))

Check NA values

sum(is.na(milkproductfacts))
## [1] 0
sum(is.na(milkcowfacts))
## [1] 0

Quick look of the datasets

head(milkproductfacts)
##   year fluid_milk fluid_yogurt   butter cheese_american cheese_other
## 1 1975        247     1.967839 4.728193        8.147222     6.126409
## 2 1976        247     2.132685 4.313202        8.883106     6.627872
## 3 1977        244     2.338369 4.294180        9.213005     6.781846
## 4 1978        241     2.448503 4.354593        9.525359     7.309603
## 5 1979        238     2.443847 4.491231        9.597205     7.567657
## 6 1980        234     2.503008 4.467509        9.620140     7.903713
##   cheese_cottage evap_cnd_canned_whole_milk evap_cnd_bulk_whole_milk
## 1       4.588537                   3.949932                1.2385962
## 2       4.632284                   3.791703                1.1008644
## 3       4.617711                   3.265569                1.0038023
## 4       4.600490                   3.148379                0.9007974
## 5       4.434472                   3.120396                0.9374522
## 6       4.408807                   2.885797                0.8767681
##   evap_cnd_bulk_and_can_skim_milk frozen_ice_cream_regular
## 1                        3.525306                 18.20505
## 2                        3.590506                 17.63845
## 3                        3.879376                 17.28895
## 4                        3.469461                 17.22533
## 5                        3.332083                 16.94341
## 6                        3.281817                 17.11750
##   frozen_ice_cream_reduced_fat frozen_sherbet frozen_other dry_whole_milk
## 1                     6.502202       1.348780     1.816894            0.1
## 2                     6.169193       1.364460     1.678171            0.2
## 3                     6.574222       1.356254     1.627777            0.2
## 4                     6.550307       1.294786     1.511782            0.3
## 5                     6.197152       1.202817     1.413432            0.3
## 6                     6.052010       1.190466     1.348990            0.3
##   dry_nonfat_milk dry_buttermilk dry_whey
## 1        3.261769            0.2      2.2
## 2        3.504864            0.2      2.4
## 3        3.308311            0.3      2.4
## 4        3.101835            0.2      2.4
## 5        3.282367            0.2      2.7
## 6        3.011035            0.2      2.7
head(milkcowfacts)
##   year avg_milk_cow_number milk_per_cow milk_production_lbs avg_price_milk
## 1 1980            10799000        11891         1.28406e+11          0.130
## 2 1981            10898000        12183         1.32770e+11          0.138
## 3 1982            11011000        12306         1.35505e+11          0.136
## 4 1983            11059000        12622         1.39588e+11          0.136
## 5 1984            10793000        12541         1.35351e+11          0.135
## 6 1985            10981000        13024         1.43012e+11          0.127
##   dairy_ration milk_feed_price_ratio milk_cow_cost_per_animal
## 1   0.04837357              2.717149                     1190
## 2   0.05035243              2.759031                     1200
## 3   0.04422102              3.088127                     1110
## 4   0.05235034              2.609620                     1030
## 5   0.05338787              2.540000                      895
## 6   0.04500458              2.840000                      860
##   milk_volume_to_buy_cow_in_lbs alfalfa_hay_price slaughter_cow_price
## 1                      9153.846          72.00000              0.4573
## 2                      8695.652          70.90000              0.4193
## 3                      8161.765          72.73333              0.3996
## 4                      7573.529          78.70000              0.3935
## 5                      6629.630          79.48333              0.3981
## 6                      6771.654          73.67500              0.3831

Conclusion: There are good news and bad news about the data sets

Good News: 1. There is no missing value in both data sets 2. There is no outlier in both data sets

Data Clean needs to be done: 1. The data’s decimal place in the two datasets are not consistent, round all data to two decimal places

Data Transformation

round all data to two decimal places for better view

milkproductfacts <- round(milkproductfacts, digits = 2)
milkcowfacts <- round(milkcowfacts, digits = 2)

preview the datasets again

glimpse(milkcowfacts)
## Observations: 35
## Variables: 11
## $ year                          <dbl> 1980, 1981, 1982, 1983, 1984, 19...
## $ avg_milk_cow_number           <dbl> 10799000, 10898000, 11011000, 11...
## $ milk_per_cow                  <dbl> 11891, 12183, 12306, 12622, 1254...
## $ milk_production_lbs           <dbl> 1.28406e+11, 1.32770e+11, 1.3550...
## $ avg_price_milk                <dbl> 0.13, 0.14, 0.14, 0.14, 0.14, 0....
## $ dairy_ration                  <dbl> 0.05, 0.05, 0.04, 0.05, 0.05, 0....
## $ milk_feed_price_ratio         <dbl> 2.72, 2.76, 3.09, 2.61, 2.54, 2....
## $ milk_cow_cost_per_animal      <dbl> 1190, 1200, 1110, 1030, 895, 860...
## $ milk_volume_to_buy_cow_in_lbs <dbl> 9153.85, 8695.65, 8161.76, 7573....
## $ alfalfa_hay_price             <dbl> 72.00, 70.90, 72.73, 78.70, 79.4...
## $ slaughter_cow_price           <dbl> 0.46, 0.42, 0.40, 0.39, 0.40, 0....
glimpse(milkproductfacts)
## Observations: 43
## Variables: 18
## $ year                            <dbl> 1975, 1976, 1977, 1978, 1979, ...
## $ fluid_milk                      <dbl> 247, 247, 244, 241, 238, 234, ...
## $ fluid_yogurt                    <dbl> 1.97, 2.13, 2.34, 2.45, 2.44, ...
## $ butter                          <dbl> 4.73, 4.31, 4.29, 4.35, 4.49, ...
## $ cheese_american                 <dbl> 8.15, 8.88, 9.21, 9.53, 9.60, ...
## $ cheese_other                    <dbl> 6.13, 6.63, 6.78, 7.31, 7.57, ...
## $ cheese_cottage                  <dbl> 4.59, 4.63, 4.62, 4.60, 4.43, ...
## $ evap_cnd_canned_whole_milk      <dbl> 3.95, 3.79, 3.27, 3.15, 3.12, ...
## $ evap_cnd_bulk_whole_milk        <dbl> 1.24, 1.10, 1.00, 0.90, 0.94, ...
## $ evap_cnd_bulk_and_can_skim_milk <dbl> 3.53, 3.59, 3.88, 3.47, 3.33, ...
## $ frozen_ice_cream_regular        <dbl> 18.21, 17.64, 17.29, 17.23, 16...
## $ frozen_ice_cream_reduced_fat    <dbl> 6.50, 6.17, 6.57, 6.55, 6.20, ...
## $ frozen_sherbet                  <dbl> 1.35, 1.36, 1.36, 1.29, 1.20, ...
## $ frozen_other                    <dbl> 1.82, 1.68, 1.63, 1.51, 1.41, ...
## $ dry_whole_milk                  <dbl> 0.10, 0.20, 0.20, 0.30, 0.30, ...
## $ dry_nonfat_milk                 <dbl> 3.26, 3.50, 3.31, 3.10, 3.28, ...
## $ dry_buttermilk                  <dbl> 0.20, 0.20, 0.30, 0.20, 0.20, ...
## $ dry_whey                        <dbl> 2.20, 2.40, 2.40, 2.40, 2.70, ...

milkcow_facts

variable class Description
year date Year
avg_milk_cow_number double Average number of milk cows
milk_per_cow double Average milk production/cow in pounds
milk_production_lbs double Total Milk production in pounds
avg_price_milk double Average price paid for milk (dollars per pound)
dairy_ration double Average price paid for dairy cow rations (dollars per pound)
milk_feed_price_ratio double Ratio of average price of milk per dairy cow ration
milk_cow_cost_per_animal double Average cost of milk cow per animal (dollars)
milk_volume_to_buy_cow_in_lbs double Milk volume required to purchase a cow (pounds)
alfalfa_hay_price double Alfalfa hay price received by farmers (tons)
slaughter_cow_price double Slaughter cow price (value of meat in dollars per pound)

milk_products_facts

variable class description
year double Year
fluid_milk double Average milk consumption in lbs per person
fluid_yogurt double Average yogurt consumption in lbs per person
butter double Average butter consumption in lbs per person
cheese_american double Average American cheese consumption in lbs per person
cheese_other double Average other cheese consumption in lbs per person
cheese_cottage double Average cottage cheese consumption in lbs per person
evap_cnd_canned_whole_milk double Average evaporated and canned whole milk consumption in lbs per person
evap_cnd_bulk_whole_milk double Average evaporated and canned bulk whole milk consumption in lbs per person
evap_cnd_bulk_and_can_skim_milk double Average evaporated and canned bulk and can skim milk consumption in lbs per person
frozen_ice_cream_regular double Average regular frozen ice cream consumption in lbs per person
frozen_ice_cream_reduced_fat double Average reducated fat frozen ice cream consumption in lbs per person
frozen_sherbet double Average frozen sherbet consumption in lbs per person
frozen_other double Average other frozen milk product consumption in lbs per person
dry_whole_milk double Average dry whole milk consumption in lbs per person
dry_nonfat_milk double Average dry nonfat milk consumption in lbs per person
dry_buttermilk double Average dry buttermilk consumption in lbs per person
dry_whey double Average dry whey (milk protein) consumption in lbs per person

4. Proposed Exploratory Data Analysis

4.1 Data Exploratory

I plan to merge following columns from milkproductfacts to create new varaibles:

cheese = cheese_american + cheese_other + cheese_cottage

evap_milk = evap_cnd_canned_whole_milk + evap_cnd_bulk_whole_milk + evap_cnd_bulk_and_can_skim_milk

ice_cream = frozen_ice_cream_regular + frozen_ice_cream_reduced_fat + frozen_sherbet + frozen_other

dry_milk = dry_whole_milk + dry_nonfat_milk + dry_buttermilk + dry_whey

4.2 Plot

I will use ggplot to visualize the popularity of each product from 1975 to 2017.

4.3 Challenge

I am not very familiar with ggplot and will need time to get familiar with it.

4.4 Support Analysis Method

Machine learning is not a must in this project.

5 Exploratory Data Analysis

Consumer Consumption (By product line) from 1975 - 2017

Data Wrangling - Merge columns

years = seq(1975, 2017, 7)
milkproduct <- milkproductfacts %>% 
  filter(year %in% years) %>%
  gather(product, lbs, fluid_milk:dry_whey) %>% 
  mutate(product = fct_collapse(product,
    "dry_milk" = c("dry_whole_milk", "dry_nonfat_milk", "dry_buttermilk"),
    "evap_milk" = c("evap_cnd_canned_whole_milk", "evap_cnd_bulk_whole_milk",
                    "evap_cnd_bulk_and_can_skim_milk"),
    "cheese" = c("cheese_american", "cheese_other", "cheese_cottage"),
    "ice_cream" = c("frozen_ice_cream_regular", "frozen_ice_cream_reduced_fat")
     )) %>% 
  group_by(year, product) %>% 
  summarise(lbs = sum(lbs)) %>% 
  ungroup()

Create new dataset with the columns created above

products <- c("butter", "cheese", "fluid_milk", "fluid_yogurt", "ice_cream")
names <- str_replace(str_to_title(str_remove(products, pattern = "fluid_")), "_", " ")

milk_product_merged <- milkproduct %>% 
  filter(product %in% products) %>% 
  mutate(product = factor(product, labels = names)) 

Data Visualization

milk_product_merged %>% 
  ggplot(aes(x = year, y = lbs, fill = product, group = product)) +
  facet_wrap(~product, nrow = 2) +
  geom_hline(yintercept = 0) + 
  geom_area() +
  labs(title = "Milk consumption Trend in (lbs) between 1975-2017") +
  theme(legend.position = "none",
        axis.text.x = element_blank(),
        axis.text.y = element_blank())

Consumption Comparison Between 1975 and 2017

milk_product_merged %>% 
  filter(year == 1975 | year == 2017) %>% 
  ggplot(aes(x = product, y = lbs, fill = as.factor(year))) +
  geom_hline(yintercept = 0, size = 1)+
  geom_bar(position = "dodge",stat = "identity") +
  labs(title = "Consumer Preference", subtitle = "Avg. consumption per person", xlab="milk_products", ylab="avg.consumption") +
  scale_fill_manual(values = c("#4393C3", "#D6604D"))

correlation matrix

Compute Correlation Matrix

cor_matrix <- cor(subset(milkcowfacts, select = -c(year))) %>%
  round(2) %>%
  corrplot(type = "lower", order = "hclust", 
           tl.col = "black", tl.srt = 45)

6 Summary

6.1 Problem Statement

  1. Milk industry needs some transformation to find new profit points. Which product line we should invest more to earn more profit?

  2. Which product lines need to drease or stop investing in?

  3. What are the factors that affect the milk price most?

6.2 Methodology

For Q1 and Q2, I merged the columns of the milkproductfacts dataset to “Butter”, “Cheese”, “Milk”, “Yogort”, “Ice Cream”. Then I used ggplot to visualize these products’ consumption from 1975 - 2017 to find the polularity of these products.

For Q3, to answer this question, I used function corrplot(), in the package of the same name, creates a graphical display of a correlation matrix, highlighting the most correlated variables in a data table.

Positive correlations are displayed in blue and negative correlations in red color. The color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors.

6.3 Interesting findings

I notice that the average milk price has a positive correlation with milk per cow and milk production. I am really suspicious about this as it violates market discipline.

6.4

  1. Cheese and yogurt comsuptions are going up throughout the years. These are the sun-rising products which we should invest in.

  2. Milk and icecream product comsumptions have been decreasing all the time. The icecream product line is the one which we should decrease investment size. Although milk product comsumption is also downsizing; it is the heart of the industry. We may consider using more milk to produce more cheese and yogurt instead of sell milk directly to the market.

  3. By reading the correlation matrix, we can find the correlation between avg_price_milk and other factors. The milk feed price ratio and slaughter cow price are the factors affect milk price most. The milk production company should give more attention to these two factors.

6.5 Drawback

One of the concerns I have is that my data is not sufficient to provide detailed insight due to the size limit of the data. This issue affects my correlation conclusion a lot. I am wondering if the conclusion is actually meaningful.

Additionally, I wish I could have more time and energy to build a multi-linear regression model for the milkcowfacts dataset. I came up with this idea just a few minutes ago. It is a pity that I don’t have sufficient time to complete this part. I believe it will be very interesting to learn what factors affected milk price most.