Milk_sales data set indicates that the trend of annual milk sales has been downtrend since 2010. If we are doing milk business and would like to explore new opportunity in this industry, what’s the potential opportunity there and what risk there could be? Which product is most cosumed in U.S? Does it mean this product could be a sun-rising industry in milk industry?
I will use milkproductfacts dataset and ggplot package in R to find the market trend in Butter, Cheese, Milk, Yogurt and Ice Cream businesses.
I will use ggplot to visualize the annual milk sales by types to prove my assumption that annual milk sales has been in downtrend. Then I will wrangle data for milkfacts dataset to create four new variables (cheese, evap_milk, ice_cream, dry_milk) by merging several columns together. At last, ggplot package will be deployed to visualize the trends of multiple products.
My analysis can help customer get a good insight of milk industry. It could support customer with decision making and strategy design to direct company to right track.
library(tidyverse)
library(ggplot2)
library(dplyr)
library(Hmisc)
library(corrplot)
fluidcow_facts.csv
milk_products_facts.csv
Data comes from the USDA (United States Department of Agriculture). The raw datasets (Excel Sheets) can be found here.
milkcowfacts <- read.csv("C:/Users/sijia/Desktop/BANA7025/milkcow_facts.csv")
milkproductfacts <- read.csv("C:/Users/sijia/Desktop/BANA7025/milk_products_facts.csv")
dim(milkcowfacts)
## [1] 35 11
dim(milkproductfacts)
## [1] 43 18
names(milkcowfacts)
## [1] "year" "avg_milk_cow_number"
## [3] "milk_per_cow" "milk_production_lbs"
## [5] "avg_price_milk" "dairy_ration"
## [7] "milk_feed_price_ratio" "milk_cow_cost_per_animal"
## [9] "milk_volume_to_buy_cow_in_lbs" "alfalfa_hay_price"
## [11] "slaughter_cow_price"
names(milkproductfacts)
## [1] "year" "fluid_milk"
## [3] "fluid_yogurt" "butter"
## [5] "cheese_american" "cheese_other"
## [7] "cheese_cottage" "evap_cnd_canned_whole_milk"
## [9] "evap_cnd_bulk_whole_milk" "evap_cnd_bulk_and_can_skim_milk"
## [11] "frozen_ice_cream_regular" "frozen_ice_cream_reduced_fat"
## [13] "frozen_sherbet" "frozen_other"
## [15] "dry_whole_milk" "dry_nonfat_milk"
## [17] "dry_buttermilk" "dry_whey"
glimpse(milkcowfacts)
## Observations: 35
## Variables: 11
## $ year <dbl> 1980, 1981, 1982, 1983, 1984, 19...
## $ avg_milk_cow_number <dbl> 10799000, 10898000, 11011000, 11...
## $ milk_per_cow <int> 11891, 12183, 12306, 12622, 1254...
## $ milk_production_lbs <dbl> 1.28406e+11, 1.32770e+11, 1.3550...
## $ avg_price_milk <dbl> 0.130, 0.138, 0.136, 0.136, 0.13...
## $ dairy_ration <dbl> 0.04837357, 0.05035243, 0.044221...
## $ milk_feed_price_ratio <dbl> 2.717149, 2.759031, 3.088127, 2....
## $ milk_cow_cost_per_animal <int> 1190, 1200, 1110, 1030, 895, 860...
## $ milk_volume_to_buy_cow_in_lbs <dbl> 9153.846, 8695.652, 8161.765, 75...
## $ alfalfa_hay_price <dbl> 72.00000, 70.90000, 72.73333, 78...
## $ slaughter_cow_price <dbl> 0.4573000, 0.4193000, 0.3996000,...
glimpse(milkproductfacts)
## Observations: 43
## Variables: 18
## $ year <int> 1975, 1976, 1977, 1978, 1979, ...
## $ fluid_milk <int> 247, 247, 244, 241, 238, 234, ...
## $ fluid_yogurt <dbl> 1.967839, 2.132685, 2.338369, ...
## $ butter <dbl> 4.728193, 4.313202, 4.294180, ...
## $ cheese_american <dbl> 8.147222, 8.883106, 9.213005, ...
## $ cheese_other <dbl> 6.126409, 6.627872, 6.781846, ...
## $ cheese_cottage <dbl> 4.588537, 4.632284, 4.617711, ...
## $ evap_cnd_canned_whole_milk <dbl> 3.949932, 3.791703, 3.265569, ...
## $ evap_cnd_bulk_whole_milk <dbl> 1.2385962, 1.1008644, 1.003802...
## $ evap_cnd_bulk_and_can_skim_milk <dbl> 3.525306, 3.590506, 3.879376, ...
## $ frozen_ice_cream_regular <dbl> 18.20505, 17.63845, 17.28895, ...
## $ frozen_ice_cream_reduced_fat <dbl> 6.502202, 6.169193, 6.574222, ...
## $ frozen_sherbet <dbl> 1.348780, 1.364460, 1.356254, ...
## $ frozen_other <dbl> 1.816894, 1.678171, 1.627777, ...
## $ dry_whole_milk <dbl> 0.1000000, 0.2000000, 0.200000...
## $ dry_nonfat_milk <dbl> 3.261769, 3.504864, 3.308311, ...
## $ dry_buttermilk <dbl> 0.2000000, 0.2000000, 0.300000...
## $ dry_whey <dbl> 2.200000, 2.400000, 2.400000, ...
boxplot(subset(milkcowfacts, select = -c(year)))
boxplot(subset(milkproductfacts, select = -c(year)))
sum(is.na(milkproductfacts))
## [1] 0
sum(is.na(milkcowfacts))
## [1] 0
head(milkproductfacts)
## year fluid_milk fluid_yogurt butter cheese_american cheese_other
## 1 1975 247 1.967839 4.728193 8.147222 6.126409
## 2 1976 247 2.132685 4.313202 8.883106 6.627872
## 3 1977 244 2.338369 4.294180 9.213005 6.781846
## 4 1978 241 2.448503 4.354593 9.525359 7.309603
## 5 1979 238 2.443847 4.491231 9.597205 7.567657
## 6 1980 234 2.503008 4.467509 9.620140 7.903713
## cheese_cottage evap_cnd_canned_whole_milk evap_cnd_bulk_whole_milk
## 1 4.588537 3.949932 1.2385962
## 2 4.632284 3.791703 1.1008644
## 3 4.617711 3.265569 1.0038023
## 4 4.600490 3.148379 0.9007974
## 5 4.434472 3.120396 0.9374522
## 6 4.408807 2.885797 0.8767681
## evap_cnd_bulk_and_can_skim_milk frozen_ice_cream_regular
## 1 3.525306 18.20505
## 2 3.590506 17.63845
## 3 3.879376 17.28895
## 4 3.469461 17.22533
## 5 3.332083 16.94341
## 6 3.281817 17.11750
## frozen_ice_cream_reduced_fat frozen_sherbet frozen_other dry_whole_milk
## 1 6.502202 1.348780 1.816894 0.1
## 2 6.169193 1.364460 1.678171 0.2
## 3 6.574222 1.356254 1.627777 0.2
## 4 6.550307 1.294786 1.511782 0.3
## 5 6.197152 1.202817 1.413432 0.3
## 6 6.052010 1.190466 1.348990 0.3
## dry_nonfat_milk dry_buttermilk dry_whey
## 1 3.261769 0.2 2.2
## 2 3.504864 0.2 2.4
## 3 3.308311 0.3 2.4
## 4 3.101835 0.2 2.4
## 5 3.282367 0.2 2.7
## 6 3.011035 0.2 2.7
head(milkcowfacts)
## year avg_milk_cow_number milk_per_cow milk_production_lbs avg_price_milk
## 1 1980 10799000 11891 1.28406e+11 0.130
## 2 1981 10898000 12183 1.32770e+11 0.138
## 3 1982 11011000 12306 1.35505e+11 0.136
## 4 1983 11059000 12622 1.39588e+11 0.136
## 5 1984 10793000 12541 1.35351e+11 0.135
## 6 1985 10981000 13024 1.43012e+11 0.127
## dairy_ration milk_feed_price_ratio milk_cow_cost_per_animal
## 1 0.04837357 2.717149 1190
## 2 0.05035243 2.759031 1200
## 3 0.04422102 3.088127 1110
## 4 0.05235034 2.609620 1030
## 5 0.05338787 2.540000 895
## 6 0.04500458 2.840000 860
## milk_volume_to_buy_cow_in_lbs alfalfa_hay_price slaughter_cow_price
## 1 9153.846 72.00000 0.4573
## 2 8695.652 70.90000 0.4193
## 3 8161.765 72.73333 0.3996
## 4 7573.529 78.70000 0.3935
## 5 6629.630 79.48333 0.3981
## 6 6771.654 73.67500 0.3831
Conclusion: There are good news and bad news about the data sets
Good News: 1. There is no missing value in both data sets 2. There is no outlier in both data sets
Data Clean needs to be done: 1. The data’s decimal place in the two datasets are not consistent, round all data to two decimal places
milkproductfacts <- round(milkproductfacts, digits = 2)
milkcowfacts <- round(milkcowfacts, digits = 2)
glimpse(milkcowfacts)
## Observations: 35
## Variables: 11
## $ year <dbl> 1980, 1981, 1982, 1983, 1984, 19...
## $ avg_milk_cow_number <dbl> 10799000, 10898000, 11011000, 11...
## $ milk_per_cow <dbl> 11891, 12183, 12306, 12622, 1254...
## $ milk_production_lbs <dbl> 1.28406e+11, 1.32770e+11, 1.3550...
## $ avg_price_milk <dbl> 0.13, 0.14, 0.14, 0.14, 0.14, 0....
## $ dairy_ration <dbl> 0.05, 0.05, 0.04, 0.05, 0.05, 0....
## $ milk_feed_price_ratio <dbl> 2.72, 2.76, 3.09, 2.61, 2.54, 2....
## $ milk_cow_cost_per_animal <dbl> 1190, 1200, 1110, 1030, 895, 860...
## $ milk_volume_to_buy_cow_in_lbs <dbl> 9153.85, 8695.65, 8161.76, 7573....
## $ alfalfa_hay_price <dbl> 72.00, 70.90, 72.73, 78.70, 79.4...
## $ slaughter_cow_price <dbl> 0.46, 0.42, 0.40, 0.39, 0.40, 0....
glimpse(milkproductfacts)
## Observations: 43
## Variables: 18
## $ year <dbl> 1975, 1976, 1977, 1978, 1979, ...
## $ fluid_milk <dbl> 247, 247, 244, 241, 238, 234, ...
## $ fluid_yogurt <dbl> 1.97, 2.13, 2.34, 2.45, 2.44, ...
## $ butter <dbl> 4.73, 4.31, 4.29, 4.35, 4.49, ...
## $ cheese_american <dbl> 8.15, 8.88, 9.21, 9.53, 9.60, ...
## $ cheese_other <dbl> 6.13, 6.63, 6.78, 7.31, 7.57, ...
## $ cheese_cottage <dbl> 4.59, 4.63, 4.62, 4.60, 4.43, ...
## $ evap_cnd_canned_whole_milk <dbl> 3.95, 3.79, 3.27, 3.15, 3.12, ...
## $ evap_cnd_bulk_whole_milk <dbl> 1.24, 1.10, 1.00, 0.90, 0.94, ...
## $ evap_cnd_bulk_and_can_skim_milk <dbl> 3.53, 3.59, 3.88, 3.47, 3.33, ...
## $ frozen_ice_cream_regular <dbl> 18.21, 17.64, 17.29, 17.23, 16...
## $ frozen_ice_cream_reduced_fat <dbl> 6.50, 6.17, 6.57, 6.55, 6.20, ...
## $ frozen_sherbet <dbl> 1.35, 1.36, 1.36, 1.29, 1.20, ...
## $ frozen_other <dbl> 1.82, 1.68, 1.63, 1.51, 1.41, ...
## $ dry_whole_milk <dbl> 0.10, 0.20, 0.20, 0.30, 0.30, ...
## $ dry_nonfat_milk <dbl> 3.26, 3.50, 3.31, 3.10, 3.28, ...
## $ dry_buttermilk <dbl> 0.20, 0.20, 0.30, 0.20, 0.20, ...
## $ dry_whey <dbl> 2.20, 2.40, 2.40, 2.40, 2.70, ...
| variable | class | Description |
|---|---|---|
| year | date | Year |
| avg_milk_cow_number | double | Average number of milk cows |
| milk_per_cow | double | Average milk production/cow in pounds |
| milk_production_lbs | double | Total Milk production in pounds |
| avg_price_milk | double | Average price paid for milk (dollars per pound) |
| dairy_ration | double | Average price paid for dairy cow rations (dollars per pound) |
| milk_feed_price_ratio | double | Ratio of average price of milk per dairy cow ration |
| milk_cow_cost_per_animal | double | Average cost of milk cow per animal (dollars) |
| milk_volume_to_buy_cow_in_lbs | double | Milk volume required to purchase a cow (pounds) |
| alfalfa_hay_price | double | Alfalfa hay price received by farmers (tons) |
| slaughter_cow_price | double | Slaughter cow price (value of meat in dollars per pound) |
| variable | class | description |
|---|---|---|
| year | double | Year |
| fluid_milk | double | Average milk consumption in lbs per person |
| fluid_yogurt | double | Average yogurt consumption in lbs per person |
| butter | double | Average butter consumption in lbs per person |
| cheese_american | double | Average American cheese consumption in lbs per person |
| cheese_other | double | Average other cheese consumption in lbs per person |
| cheese_cottage | double | Average cottage cheese consumption in lbs per person |
| evap_cnd_canned_whole_milk | double | Average evaporated and canned whole milk consumption in lbs per person |
| evap_cnd_bulk_whole_milk | double | Average evaporated and canned bulk whole milk consumption in lbs per person |
| evap_cnd_bulk_and_can_skim_milk | double | Average evaporated and canned bulk and can skim milk consumption in lbs per person |
| frozen_ice_cream_regular | double | Average regular frozen ice cream consumption in lbs per person |
| frozen_ice_cream_reduced_fat | double | Average reducated fat frozen ice cream consumption in lbs per person |
| frozen_sherbet | double | Average frozen sherbet consumption in lbs per person |
| frozen_other | double | Average other frozen milk product consumption in lbs per person |
| dry_whole_milk | double | Average dry whole milk consumption in lbs per person |
| dry_nonfat_milk | double | Average dry nonfat milk consumption in lbs per person |
| dry_buttermilk | double | Average dry buttermilk consumption in lbs per person |
| dry_whey | double | Average dry whey (milk protein) consumption in lbs per person |
I plan to merge following columns from milkproductfacts to create new varaibles:
cheese = cheese_american + cheese_other + cheese_cottage
evap_milk = evap_cnd_canned_whole_milk + evap_cnd_bulk_whole_milk + evap_cnd_bulk_and_can_skim_milk
ice_cream = frozen_ice_cream_regular + frozen_ice_cream_reduced_fat + frozen_sherbet + frozen_other
dry_milk = dry_whole_milk + dry_nonfat_milk + dry_buttermilk + dry_whey
I will use ggplot to visualize the popularity of each product from 1975 to 2017.
I am not very familiar with ggplot and will need time to get familiar with it.
Machine learning is not a must in this project.
years = seq(1975, 2017, 7)
milkproduct <- milkproductfacts %>%
filter(year %in% years) %>%
gather(product, lbs, fluid_milk:dry_whey) %>%
mutate(product = fct_collapse(product,
"dry_milk" = c("dry_whole_milk", "dry_nonfat_milk", "dry_buttermilk"),
"evap_milk" = c("evap_cnd_canned_whole_milk", "evap_cnd_bulk_whole_milk",
"evap_cnd_bulk_and_can_skim_milk"),
"cheese" = c("cheese_american", "cheese_other", "cheese_cottage"),
"ice_cream" = c("frozen_ice_cream_regular", "frozen_ice_cream_reduced_fat")
)) %>%
group_by(year, product) %>%
summarise(lbs = sum(lbs)) %>%
ungroup()
products <- c("butter", "cheese", "fluid_milk", "fluid_yogurt", "ice_cream")
names <- str_replace(str_to_title(str_remove(products, pattern = "fluid_")), "_", " ")
milk_product_merged <- milkproduct %>%
filter(product %in% products) %>%
mutate(product = factor(product, labels = names))
milk_product_merged %>%
ggplot(aes(x = year, y = lbs, fill = product, group = product)) +
facet_wrap(~product, nrow = 2) +
geom_hline(yintercept = 0) +
geom_area() +
labs(title = "Milk consumption Trend in (lbs) between 1975-2017") +
theme(legend.position = "none",
axis.text.x = element_blank(),
axis.text.y = element_blank())
milk_product_merged %>%
filter(year == 1975 | year == 2017) %>%
ggplot(aes(x = product, y = lbs, fill = as.factor(year))) +
geom_hline(yintercept = 0, size = 1)+
geom_bar(position = "dodge",stat = "identity") +
labs(title = "Consumer Preference", subtitle = "Avg. consumption per person", xlab="milk_products", ylab="avg.consumption") +
scale_fill_manual(values = c("#4393C3", "#D6604D"))
cor_matrix <- cor(subset(milkcowfacts, select = -c(year))) %>%
round(2) %>%
corrplot(type = "lower", order = "hclust",
tl.col = "black", tl.srt = 45)
Milk industry needs some transformation to find new profit points. Which product line we should invest more to earn more profit?
Which product lines need to drease or stop investing in?
What are the factors that affect the milk price most?
For Q1 and Q2, I merged the columns of the milkproductfacts dataset to “Butter”, “Cheese”, “Milk”, “Yogort”, “Ice Cream”. Then I used ggplot to visualize these products’ consumption from 1975 - 2017 to find the polularity of these products.
For Q3, to answer this question, I used function corrplot(), in the package of the same name, creates a graphical display of a correlation matrix, highlighting the most correlated variables in a data table.
Positive correlations are displayed in blue and negative correlations in red color. The color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors.
I notice that the average milk price has a positive correlation with milk per cow and milk production. I am really suspicious about this as it violates market discipline.
Cheese and yogurt comsuptions are going up throughout the years. These are the sun-rising products which we should invest in.
Milk and icecream product comsumptions have been decreasing all the time. The icecream product line is the one which we should decrease investment size. Although milk product comsumption is also downsizing; it is the heart of the industry. We may consider using more milk to produce more cheese and yogurt instead of sell milk directly to the market.
By reading the correlation matrix, we can find the correlation between avg_price_milk and other factors. The milk feed price ratio and slaughter cow price are the factors affect milk price most. The milk production company should give more attention to these two factors.
One of the concerns I have is that my data is not sufficient to provide detailed insight due to the size limit of the data. This issue affects my correlation conclusion a lot. I am wondering if the conclusion is actually meaningful.
Additionally, I wish I could have more time and energy to build a multi-linear regression model for the milkcowfacts dataset. I came up with this idea just a few minutes ago. It is a pity that I don’t have sufficient time to complete this part. I believe it will be very interesting to learn what factors affected milk price most.