Milk_sales data set indicates that the trend of annual milk sales has been downtrend since 2010. If we are doing milk business and would like to explore new opportunity in this industry, what’s the potential opportunity there and what risk there could be? Which product is most cosumed in U.S? Does it mean this product could be a sun-rising industry in milk industry?
I will use milk_product_facts dataset and ggplot package in R to find the market trend in Butter, Cheese, Milk, Yogurt and Ice Cream businesses.
I will first do data wrangling for Milk_sales data set to clean and tidy the data. Then I will use ggplot to visualize the annual milk sales by types to prove my assumption that annual milk sales has been in downtrend. Then I will wrangle data for milkproductfacts dataset to create four new variables (cheese, evap_milk, ice_cream, dry_milk) by merging several columns together. At last, ggplot package will be deployed to visualize the trends of multiple products of milk industry to find the sun-rising product which investor can step into.
My analysis can help customer get a good insight of milk industry. It could support customer with decision making and strategy design to direct company to right track.
## Clean and tidy dataset
library(tidyverse)
## Used to quickly get a complete insight of the dataset
library(skimr)
## Data visualization
library(ggplot2)
fluid_milk_sales.csv
milk_products_facts.csv
Data comes from the USDA (United States Department of Agriculture). The raw datasets (Excel Sheets) can be found here.
##Importing Data
milksales <- read.csv("C:/Users/sijia/Desktop/BANA7025/fluid_milk_sales.csv")
milkproductfacts <- read.csv("C:/Users/sijia/Desktop/BANA7025/milk_products_facts.csv")
##Quick review for each dataset
skim(milkproductfacts)
## Skim summary statistics
## n obs: 43
## n variables: 18
##
## -- Variable type:integer --------------------------------------------------------------------------------------------
## variable missing complete n mean sd p0 p25 p50 p75 p100
## fluid_milk 0 43 43 202.91 27.03 149 183 205 223.5 247
## year 0 43 43 1996 12.56 1975 1985.5 1996 2006.5 2017
## hist
## <U+2583><U+2582><U+2585><U+2586><U+2583><U+2582><U+2587><U+2583>
## <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587>
##
## -- Variable type:numeric --------------------------------------------------------------------------------------------
## variable missing complete n mean sd p0
## butter 0 43 43 4.71 0.43 4.19
## cheese_american 0 43 43 11.95 1.5 8.15
## cheese_cottage 0 43 43 3.13 0.86 2.07
## cheese_other 0 43 43 14.71 4.82 6.13
## dry_buttermilk 0 43 43 0.23 0.054 0.17
## dry_nonfat_milk 0 43 43 3.02 0.53 2.12
## dry_whey 0 43 43 3.05 0.66 1.89
## dry_whole_milk 0 43 43 0.31 0.14 0.095
## evap_cnd_bulk_and_can_skim_milk 0 43 43 4.32 0.82 3.02
## evap_cnd_bulk_whole_milk 0 43 43 0.81 0.29 0.44
## evap_cnd_canned_whole_milk 0 43 43 2.04 0.71 0.94
## fluid_yogurt 0 43 43 7.16 4.34 1.97
## frozen_ice_cream_reduced_fat 0 43 43 6.4 0.43 5.67
## frozen_ice_cream_regular 0 43 43 15.63 1.65 12.47
## frozen_other 0 43 43 3.13 1.37 1.35
## frozen_sherbet 0 43 43 1.14 0.15 0.8
## p25 p50 p75 p100 hist
## 4.37 4.54 4.91 5.69 <U+2587><U+2587><U+2585><U+2583><U+2582><U+2581><U+2583><U+2582>
## 11.28 12.12 12.95 15.06 <U+2581><U+2582><U+2581><U+2585><U+2586><U+2587><U+2582><U+2581>
## 2.56 2.65 4.03 4.63 <U+2586><U+2587><U+2581><U+2582><U+2581><U+2581><U+2583><U+2583>
## 10.68 15.26 18.96 22.05 <U+2586><U+2582><U+2582><U+2583><U+2586><U+2585><U+2587><U+2585>
## 0.2 0.2 0.25 0.39 <U+2585><U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
## 2.62 3.05 3.31 4.28 <U+2583><U+2586><U+2585><U+2587><U+2587><U+2582><U+2581><U+2582>
## 2.4 3.02 3.65 4.09 <U+2583><U+2587><U+2583><U+2585><U+2583><U+2586><U+2587><U+2585>
## 0.2 0.3 0.4 0.6 <U+2583><U+2587><U+2582><U+2582><U+2587><U+2581><U+2583><U+2581>
## 3.64 4.24 5.17 5.58 <U+2587><U+2583><U+2587><U+2586><U+2582><U+2582><U+2587><U+2587>
## 0.58 0.7 1.06 1.46 <U+2587><U+2587><U+2582><U+2583><U+2583><U+2583><U+2582><U+2582>
## 1.49 1.84 2.34 3.95 <U+2581><U+2587><U+2583><U+2583><U+2581><U+2582><U+2581><U+2581>
## 3.78 5.87 11.31 14.93 <U+2587><U+2587><U+2586><U+2582><U+2582><U+2582><U+2582><U+2586>
## 6.08 6.33 6.61 7.55 <U+2582><U+2587><U+2585><U+2587><U+2582><U+2582><U+2581><U+2581>
## 14.69 15.71 17.06 18.21 <U+2585><U+2581><U+2582><U+2585><U+2587><U+2582><U+2586><U+2586>
## 2.27 2.91 3.76 6.54 <U+2585><U+2582><U+2587><U+2582><U+2582><U+2581><U+2582><U+2581>
## 1.11 1.18 1.22 1.36 <U+2582><U+2582><U+2581><U+2581><U+2583><U+2587><U+2583><U+2582>
skim(milksales)
## Skim summary statistics
## n obs: 387
## n variables: 3
##
## -- Variable type:factor ---------------------------------------------------------------------------------------------
## variable missing complete n n_unique
## milk_type 0 387 387 9
## top_counts ordered
## But: 43, Egg: 43, Fla: 43, Fla: 43 FALSE
##
## -- Variable type:numeric --------------------------------------------------------------------------------------------
## variable missing complete n mean sd p0 p25
## pounds 0 387 387 1.2e+10 1.7e+10 7.6e+07 8.4e+08
## year 0 387 387 1996 12.43 1975 1985
## p50 p75 p100 hist
## 3.9e+09 1.7e+10 5.6e+10 <U+2587><U+2581><U+2582><U+2581><U+2581><U+2581><U+2581><U+2582>
## 1996 2007 2017 <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587>
#No outlier is detected
boxplot(subset(milkproductfacts, select = -c(year)))
#Quick look of the datasets
head(milkproductfacts)
## year fluid_milk fluid_yogurt butter cheese_american cheese_other
## 1 1975 247 1.967839 4.728193 8.147222 6.126409
## 2 1976 247 2.132685 4.313202 8.883106 6.627872
## 3 1977 244 2.338369 4.294180 9.213005 6.781846
## 4 1978 241 2.448503 4.354593 9.525359 7.309603
## 5 1979 238 2.443847 4.491231 9.597205 7.567657
## 6 1980 234 2.503008 4.467509 9.620140 7.903713
## cheese_cottage evap_cnd_canned_whole_milk evap_cnd_bulk_whole_milk
## 1 4.588537 3.949932 1.2385962
## 2 4.632284 3.791703 1.1008644
## 3 4.617711 3.265569 1.0038023
## 4 4.600490 3.148379 0.9007974
## 5 4.434472 3.120396 0.9374522
## 6 4.408807 2.885797 0.8767681
## evap_cnd_bulk_and_can_skim_milk frozen_ice_cream_regular
## 1 3.525306 18.20505
## 2 3.590506 17.63845
## 3 3.879376 17.28895
## 4 3.469461 17.22533
## 5 3.332083 16.94341
## 6 3.281817 17.11750
## frozen_ice_cream_reduced_fat frozen_sherbet frozen_other dry_whole_milk
## 1 6.502202 1.348780 1.816894 0.1
## 2 6.169193 1.364460 1.678171 0.2
## 3 6.574222 1.356254 1.627777 0.2
## 4 6.550307 1.294786 1.511782 0.3
## 5 6.197152 1.202817 1.413432 0.3
## 6 6.052010 1.190466 1.348990 0.3
## dry_nonfat_milk dry_buttermilk dry_whey
## 1 3.261769 0.2 2.2
## 2 3.504864 0.2 2.4
## 3 3.308311 0.3 2.4
## 4 3.101835 0.2 2.4
## 5 3.282367 0.2 2.7
## 6 3.011035 0.2 2.7
head(milksales)
## year milk_type pounds
## 1 1975 Whole 3.6188e+10
## 2 1976 Whole 3.5241e+10
## 3 1977 Whole 3.4036e+10
## 4 1978 Whole 3.3235e+10
## 5 1979 Whole 3.2480e+10
## 6 1980 Whole 3.1253e+10
| variable | class | description |
|---|---|---|
| year | date | Year |
| milk_type | integer | Category of Milk product |
| pounds | double | Pounds of milk product per year |
| variable | class | description |
|---|---|---|
| year | date | Year |
| fluid_milk | double | Average milk consumption in lbs per person |
| fluid_yogurt | double | Average yogurt consumption in lbs per person |
| butter | double | Average butter consumption in lbs per person |
| cheese_american | double | Average American cheese consumption in lbs per person |
| cheese_other | double | Average other cheese consumption in lbs per person |
| cheese_cottage | double | Average cottage cheese consumption in lbs per person |
| evap_cnd_canned_whole_milk | double | Average evaporated and canned whole milk consumption in lbs per person |
| evap_cnd_bulk_whole_milk | double | Average evaporated and canned bulk whole milk consumption in lbs per person |
| evap_cnd_bulk_and_can_skim_milk | double | Average evaporated and canned bulk and can skim milk consumption in lbs per person |
| frozen_ice_cream_regular | double | Average regular frozen ice cream consumption in lbs per person |
| frozen_ice_cream_reduced_fat | double | Average reducated fat frozen ice cream consumption in lbs per person |
| frozen_sherbet | double | Average frozen sherbet consumption in lbs per person |
| frozen_other | double | Average other frozen milk product consumption in lbs per person |
| dry_whole_milk | double | Average dry whole milk consumption in lbs per person |
| dry_nonfat_milk | double | Average dry nonfat milk consumption in lbs per person |
| dry_buttermilk | double | Average dry buttermilk consumption in lbs per person |
| dry_whey | double | Average dry whey (milk protein) consumption in lbs per person |
I plan to merge following columns from milkproductfacts to create new varaibles:
cheese = cheese_american + cheese_other + cheese_cottage
evap_milk = evap_cnd_canned_whole_milk + evap_cnd_bulk_whole_milk + evap_cnd_bulk_and_can_skim_milk
ice_cream = frozen_ice_cream_regular + frozen_ice_cream_reduced_fat + frozen_sherbet + frozen_other
dry_milk = dry_whole_milk + dry_nonfat_milk + dry_buttermilk + dry_whey
I will use ggplot to visualize the popularity of each product from 1975 to 2017.
I am not very familiar with ggplot and will need time to get familiar with it.
Machine learning is not a must in this project.