Photo by Russell 2020


Reading time: 29 minutes

1 SUMMARY

This is a personal side project using an avocado public data set introduced in the 6-month extensive Google Data Analytics Professional Certificate program. The goal of this project is to identify which city in the US has the cheapest avocado and is the price able to fit within a budget of $800 a year with a consumption rate of minimum 1 avocado per day. Another goal is to predict the trend of avocado prices, organic and non-organic, in the future 24 months and hopefully their prices in the chosen city remain budget-friendly.

Numerous graphs were created during EDA and prediction, include box plots, tree maps, bar charts, pie charts, different types of time plot (general, differences, sub-series), plots for ACF-Log, histograms for normality and forecasting graphs. Forecasting touches on seasonality and stationary tests using pp-test, wo-test, and autocorrrelation test using Ljung-Box test. There are 4 forecasting activities on predicting the price of non-organic and organic avocado prices in the US, as well as in the city of chosen in the next 24 months.

Results reveal that 97% of avocados in the US were sold as non-organic avocado, and 3% as organic. Most popular avocados are small and medium size (PLU4225). Picking Houston as the city of first preference to move to as the city is the second cheapest non-organic avocado city, the cheapest organic avocado city, and the top 10 most avocado-consuming cities. From my forecast, 95% of non-organic avocado prices in Houston will fall below my threshold of 2 dollars per avocado, but for organic avocado, 95% of chance that the price may exceed 2 dollars after 2019. Base on this data set and results, I will move to Houston, consume organic avocado until the end of 2019, then switch to non-organic ones. These conditions fulfilled my desire of at least 1 avocado per day and not exceeding threshold of $800 per year, and in the same time being environmental friendlier.

2 R PACKAGES

R packages loaded in this projects include tidyverse packages (ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats), skimr, lubridate, kableExtra, ggrepel, hms, treemapify, fpp2, and seastests.

library(tidyverse)
library(skimr)
library(lubridate)
library(kableExtra)
library(ggrepel)
library(hms)
library(treemapify)
library(fpp2)
library(seastests)

3 SCENARIO AND BUSINESS QUESTIONS

In this scenario - (1) I am an avocado lover, (2) I must consume minimum of 1 avocado per day, and (3) I am migrating to US next year to experience US life.

I would like to move to a US city with cheap avocado. Despite I would like to eat as more as possible but I still have a budget of $800 per year allocated for avocado consumption.

In my plan, I would like find out the top-5 cities where avocados are the cheapest and hopefully the cheap price remains similar in the next 2 years after I move there. It will be great if I can contribute something for the environment by eating organic avocado. Of course, if the organic price is within my budget.

I hope this project can give me some insightful answers!

4 Data PREPARATION

Data set used in this personal project has an official name known as “Avocado Prices”. The data set was introduced by the 6-month extensive Google Professional Data Analytics Certificate that I completed on 31 May 2021. This data set was used by the program in a few parts of its data analysis activities.

Instructed by the Google program, the data set was downloaded from Kaggle. Kaggle is a great website for data science community, click this link to the page of the dataset. The data set was created and uploaded by Justin Kiggins onto Kaggle and the data was originally from the Hass Avocado Board website in May 2018, click here to the Hass Avocado Board website.

4.1 Data import

This section imports the “Avocado Price” data set into R with a given data set name called “avocado”. Following table shows the first 10,000 rows of raw data within the data set.

avocado <- read_csv("avocado.csv")
avocado

4.2 Data description

The data set contains important information such as dates, avocado average prices, amounts sold, 3 PLU codes of avocados, type of the produce and associated region. Check out following table for these details. The information in the description column are from Kaggle.

Variable <- c("X1",
              "Date",
              "AveragePrice",
              "Total Volume",
              "4046",
              "4225",
              "4770",
              "Total Bags",
              "Small Bags",
              "Large Bags",
              "XLarge Bags",
              "type",
              "year",
              "region")

Description <- c("Index",
                 "The date of the observation",
                 "The average price of a single avocado, even when multiple avocados are sold in bags",
                 "Total number of avocados sold",
                 "Total number of avocados sold by 4046, it is a Product Lookup (PLU) code",
                 "Total number of avocados sold by 4225, it is a Product Lookup (PLU) code",
                 "Total number of avocados sold by 4770, it is a Product Lookup (PLU) code",
                 "Missing information (Kar: The unit is actually per avocado instead of per bag- see section 5)",
                 "Missing information (Kar:  The unit is actually per avocado, see section 5)",
                 "Missing information (Kar: It should be the total number of avocado sold, not number of bags, see section 5)",
                 "Missing information (Kar: It should be the total number of avocado sold, not number of bags, see section 5)",
                 "conventional or organic",
                 "the year",
                 "the city or region of the observation")

data.frame(Variable, Description) %>% 
  kbl(align = "l", 
      table.attr = "style = 'width: 60;'") %>% 
  kable_styling(bootstrap_options = c("hover",
                                      "bordered",
                                      "striped"))
Variable Description
X1 Index
Date The date of the observation
AveragePrice The average price of a single avocado, even when multiple avocados are sold in bags
Total Volume Total number of avocados sold
4046 Total number of avocados sold by 4046, it is a Product Lookup (PLU) code
4225 Total number of avocados sold by 4225, it is a Product Lookup (PLU) code
4770 Total number of avocados sold by 4770, it is a Product Lookup (PLU) code
Total Bags Missing information (Kar: The unit is actually per avocado instead of per bag- see section 5)
Small Bags Missing information (Kar: The unit is actually per avocado, see section 5)
Large Bags Missing information (Kar: It should be the total number of avocado sold, not number of bags, see section 5)
XLarge Bags Missing information (Kar: It should be the total number of avocado sold, not number of bags, see section 5)
type conventional or organic
year the year
region the city or region of the observation

According to the Hass Avocado Board website, 3 of the above mentioned Product Lookup (PLU) codes are:

PLU <- c("4046", "4225", "4770")
Category <- c("Small/Medium Hass Avocado (~3-5oz avocado)", 
              "Large Hass Avocado (~8-10oz avocado)",
              "Extra Large Hass Avocado (~10-15oz avocado)")

data.frame(PLU, Category) %>% 
  kbl(align = "l",
      table.attr = "style = 'width: 40;'") %>% 
  kable_classic(c("hover", "condense"))
PLU Category
4046 Small/Medium Hass Avocado (~3-5oz avocado)
4225 Large Hass Avocado (~8-10oz avocado)
4770 Extra Large Hass Avocado (~10-15oz avocado)

4.3 Data exploration

This data set has 18,249 rows of data (or known as observation) and 14 variables. It has 1 character column, 1 date and 11 numeric columns.

  • Following data summary table group these variables into different chunks with examples from the data set.
  • This table n_missing and complete_rate information to tell completeness within eac variable.

skim_without_charts(avocado)
Data summary
Name avocado
Number of rows 18249
Number of columns 14
_______________________
Column type frequency:
character 2
Date 1
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
type 0 1 7 12 0 2 0
region 0 1 4 19 0 54 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
Date 0 1 2015-01-04 2018-03-25 2016-08-14 169

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
X1 0 1 24.23 15.48 0.00 10.00 24.00 38.00 52.00
AveragePrice 0 1 1.41 0.40 0.44 1.10 1.37 1.66 3.25
Total Volume 0 1 850644.01 3453545.36 84.56 10838.58 107376.76 432962.29 62505646.52
4046 0 1 293008.42 1264989.08 0.00 854.07 8645.30 111020.20 22743616.17
4225 0 1 295154.57 1204120.40 0.00 3008.78 29061.02 150206.86 20470572.61
4770 0 1 22839.74 107464.07 0.00 0.00 184.99 6243.42 2546439.11
Total Bags 0 1 239639.20 986242.40 0.00 5088.64 39743.83 110783.37 19373134.37
Small Bags 0 1 182194.69 746178.51 0.00 2849.42 26362.82 83337.67 13384586.80
Large Bags 0 1 54338.09 243965.96 0.00 127.47 2647.71 22029.25 5719096.61
XLarge Bags 0 1 3106.43 17692.89 0.00 0.00 0.00 132.50 551693.65
year 0 1 2016.15 0.94 2015.00 2015.00 2016.00 2017.00 2018.00

The data set is quite complete without any missing values. The columns n_missing and complete_rate indicate that all variables have 0 missing value and achieve 1 (100%) complete rate.

There are no white spaces to trim as well, by examining the column whitespace.

We can clearly see that in the middle section, Variable type: Date, shows that this data set was recorded from 2015-01-04 to 2018-03-25.

5 DATA INTEGRITY ASSESSMENT

5.1 “Unit” problems in the data

Some important data problems have been identified.

  • Variables “Total Volume”, “4046”, “4225”, “4770”, “Total Bags”, “Small Bags”, “Large Bags”, and “XLarge Bags” should be integers because they are recording how many avocados were sold, and thus the values should not have floating numbers. Justin Kiggins (the dataset creator on Kaggle) has also confirmed with us that he was pretty sure these numbers are integers. I will follow his information and remove all the floating numbers (decimal numbers) during data cleaning.

Showing the first 3 rows of these columns:

avocado[1:3, 4:11]

Clich this Link to the relevant discussion page:

  • However, there is still missing information to explain the unit of the Total Bags, Small BagsLarge Bags, and XLarge Bags. I was thinking that the unit must be “bag”, however when I looked at the column “Total Volume”, and found that it is actually the combination of the columns “4046”, “4225”, “4770”, “Small Bags”, “Large Bags”, and “XLarge Bags”.

Try the calculation on the above first row

4046 + 4225 + 4770 + Small Bags + Large Bags + XLarge Bag (Notice: Total Bags is not included.)
= 1036.74 + 54454.85 + 48.16 + 8603.62 + 93.25 + 0
=

1036.74 + 54454.85 + 48.16  + 8603.62   + 93.25 + 0
## [1] 64236.62

And it matches the value in the first row of Total Volume: 64236.62.

It should not be the case because bag columns should be in the unit of bags, and they should be converted into actual number of avocado sold before taking part into that addition. Quite strangely, I am not able to find the information regarding how many avocado in each size category of bag and thus I am not able to do the conversion.

So, I would believe that the values in the columns of Total, Small, Large, and XLarge Bags should be the actual amount of avocado sold already, instead of in the unit of bag. The data supplier may have helped us done the the conversion.

These problems can be easily solved by,

  1. Taking the numbers in the columns “Total Volume”, “4046”, “4225”, “4770”, “Total Bags”, “Small Bags”, “Large Bags”, and “XLarge Bags” as integers.
  2. Taking the values in “Total Bags”, “Small Bags”, “Large Bags”, and “XLarge Bags” as total number of avocados sold.

And removing:

  1. All the floating numbers in all the aforementioned columns.
  2. the column “Total Volume”. It is an easy metric from the sum of 4046, 4225, 4770, and 3 sizes of bags, I will create a new one later after data cleaning.
  3. the column “Total Bags”. It is an easy metric from the sum of the 3 sizes of bags. I will create a new one later after data cleaning, if required.

These activities will be performed in the data cleaning section.

5.2 Is the data ROCCC?

I will examine the context of the data set on following aspects:

  1. Reliability - The data set come from reliable sources, It was from the Kaggle, a well known sites for professional data science community, as well as from the well-designed Google data analytic course that I completed. This data set should be free from unfairness elements, malicious manipulation, and biases from unprofessional practices during data collection and preliminary processing.
  2. Original - I did not collect this data set myself but obtained from the Kaggle website, and the up-loader obtained from another website. I say it is a third-hand data, and which explains some ambiguity and confusion in the data set mentioned in the previous section.
  3. Comprehensive - This data is not as comprehensive as I thought but should be enough for the core purpose of this personal project.
  4. Cited - This data set has been cited on Kaggle website that links to the original paper, and identifying who are the authors.
  5. Current - This data is not current as it is from 2015-01-04 to 2018-03-25.

In short, this data is moderately reliable, not original, less comprehensive, cited and not current. Therefore, these elements would affect the reliability of the results from this analysis.

6 DATA CLEANING AND MANIPULATION

6.1 Convert column names to lower and snake case

This is a standard naming convention in data industry to boost code readability and to avoid potential system errors due to spaces in the strings (character data) when using a programming language for analysis. For example, converting “Total Volume” into “total_volume” is often the standard approach.

Click the right button.

# The dataset is now called avocado2 and being worked on, I am preserving the original data set.
avocado2 <- avocado 

# to lower case
names(avocado2) <- tolower(names(avocado2))

# to snake case
avocado2 <- avocado2 %>% 
  rename("average_price" = "averageprice",
         "total_volume" = "total volume",
         "total_bags" = "total bags",
         "amount_from_small_bags" = "small bags",
         "amount_from_large_bags" = "large bags",
         "amount_from_xlarge_bags" = "xlarge bags")

# check out the column names

I use these codes to convert “averageprice”, “total volume”, “total bags”, “small bags”, “large bags”, and “xlarge bags” into lower and snake case “average_price”, “total_volume”, “total_bags”, “amount_from_small_bags”, “amount_from_large_bags”, and “amount_from_xlarge_bags”.

Checking these conversion using the first 3 rows:

avocado2[1:3,]

6.2 Renaming the “4046”, ”4225“, and”4770"

This is also to reduce potential errors during data manipulation. They may be recognised as numbers during analysis instead of being recognised as column names.

Following codes complete the conversion. Click the right code button.

avocado2 <- avocado2 %>% 
  rename("PLU4046" = "4046",
         "PLU4225" = "4225",
         "PLU4770" = "4770")

6.3 Removing unwanted columns

Removing “x1”, “total_volume”, and “total_bags”.

  • “x1” is redundant.
  • “total_volume” - It is just the sum of “4046”, “4225”, “4770”, “total_bags”, “small_bags”, “large_bags”, and “xlarge_bags”.
  • “total_bags” - It is just the sum of “small_bags”, “large_bags”, and “xlarge_bags”.

Recall, I am removing “total_volume”, and “total_bags” because their result is faulty as values within these associated columns are faulty. They should be integer but floating numbers were introduced - see section 5.1.

Using following codes (click right):

avocado2 <- avocado2 %>% 
  select(-x1, -total_volume, -total_bags)

Now I have them removed. The number of column has now been reduced from 14 columns to 11 columns.

avocado2[1:3,]

6.4 Removing decimal places

This section removes erroneous decimal places in “PLU4046”, “PLU4225”, “PLU4770”, “amount_from_small_bags”, “amount_from_large_bags”, and “amount_from_xlarge_bags” by rounding up.

avocado2 <- avocado2 %>% 
  mutate(PLU4046 = round(PLU4046),
         PLU4225 = round(PLU4225),
         PLU4770 = round(PLU4770),
         amount_from_small_bags = round(amount_from_small_bags),
         amount_from_large_bags = round(amount_from_large_bags),
         amount_from_xlarge_bags = round(amount_from_xlarge_bags))

6.5 Data type conversion

Converting the variables “Type”, “Year”, and “region” from their data type (denoted as “chr”, “dbl”, “chr”) into factor type (will be denoted as “fctr”), because they have categorical feature, by which they can be used to group the data during analysis, and be ordered into certain sequence of preference.

More importantly, their levels (elements within them) can be extracted using code and be inspected for errors such as misspelling.

Codes to complete the conversion, click right button:

avocado2 <- avocado2 %>% 
  mutate_if(is.character, factor) %>% 
  mutate(year = factor(year))

Checking is there errors within these variables, click the right button:

avocado2 %>% 
  select(type, year, region) %>% 
  sapply(levels)
## $type
## [1] "conventional" "organic"     
## 
## $year
## [1] "2015" "2016" "2017" "2018"
## 
## $region
##  [1] "Albany"              "Atlanta"             "BaltimoreWashington"
##  [4] "Boise"               "Boston"              "BuffaloRochester"   
##  [7] "California"          "Charlotte"           "Chicago"            
## [10] "CincinnatiDayton"    "Columbus"            "DallasFtWorth"      
## [13] "Denver"              "Detroit"             "GrandRapids"        
## [16] "GreatLakes"          "HarrisburgScranton"  "HartfordSpringfield"
## [19] "Houston"             "Indianapolis"        "Jacksonville"       
## [22] "LasVegas"            "LosAngeles"          "Louisville"         
## [25] "MiamiFtLauderdale"   "Midsouth"            "Nashville"          
## [28] "NewOrleansMobile"    "NewYork"             "Northeast"          
## [31] "NorthernNewEngland"  "Orlando"             "Philadelphia"       
## [34] "PhoenixTucson"       "Pittsburgh"          "Plains"             
## [37] "Portland"            "RaleighGreensboro"   "RichmondNorfolk"    
## [40] "Roanoke"             "Sacramento"          "SanDiego"           
## [43] "SanFrancisco"        "Seattle"             "SouthCarolina"      
## [46] "SouthCentral"        "Southeast"           "Spokane"            
## [49] "StLouis"             "Syracuse"            "Tampa"              
## [52] "TotalUS"             "West"                "WestTexNewMexico"

During inspection, I found there are no misspelling, spaces, or typo, and I am good to proceed.

6.6 Managing the messy “region” data

The “region” column from the data set is actually a faulty mix of cities, states and generic location such as “West” and “Northeast”. It is quite a mess, but it is why I am here for. I will create a new column located before the “region” column and having it to group cities, stats and generic location, or something else if necessary.

avocado2 <- avocado2 %>% 
  mutate(region_grouping = fct_collapse(region,
    "state" = c("California", "SouthCarolina"),
    "city" = c("Albany", "Atlanta", "BaltimoreWashington", "Boise", "Boston", "BuffaloRochester", "Charlotte", "Chicago", "CincinnatiDayton", "Columbus", "DallasFtWorth", "Denver", "Detroit", "GrandRapids", "GreatLakes", "HarrisburgScranton", "HartfordSpringfield", "Houston", "Indianapolis", "Jacksonville", "LasVegas", "LosAngeles", "Louisville", "MiamiFtLauderdale", "Nashville", "NewOrleansMobile", "NewYork", "Orlando", "Philadelphia", "PhoenixTucson", "Pittsburgh", "Plains", "Portland", "RaleighGreensboro", "RichmondNorfolk", "Roanoke", "Sacramento", "SanDiego", "SanFrancisco", "Seattle", "Spokane", "StLouis", "Syracuse", "Tampa", "WestTexNewMexico"),
    "regional" = c("Midsouth", "Northeast", "NorthernNewEngland", "SouthCentral", "Southeast", "West"),
    "Total" = c("TotalUS")
  )) %>% 
  relocate(region_grouping, .before = region)

Please note: “BaltimoreWashington” is actually Baltimore in Maryland not Washington. It might be a typo.

avocado2 %>% 
  select(region_grouping, region) %>% 
  group_by(region_grouping) %>% 
  summarise("region" = paste(unique(region), collapse = ", ")) %>% 
  kbl(caption = "Doing my best to group US cities, states, and regional category based on information searched from Google Map (Finger-crossing)") %>% 
  kable_styling(bootstrap_options = c("border", "hover"))
Doing my best to group US cities, states, and regional category based on information searched from Google Map (Finger-crossing)
region_grouping region
city Albany, Atlanta, BaltimoreWashington, Boise, Boston, BuffaloRochester, Charlotte, Chicago, CincinnatiDayton, Columbus, DallasFtWorth, Denver, Detroit, GrandRapids, GreatLakes, HarrisburgScranton, HartfordSpringfield, Houston, Indianapolis, Jacksonville, LasVegas, LosAngeles, Louisville, MiamiFtLauderdale, Nashville, NewOrleansMobile, NewYork, Orlando, Philadelphia, PhoenixTucson, Pittsburgh, Plains, Portland, RaleighGreensboro, RichmondNorfolk, Roanoke, Sacramento, SanDiego, SanFrancisco, Seattle, Spokane, StLouis, Syracuse, Tampa, WestTexNewMexico
state California, SouthCarolina
regional Midsouth, Northeast, NorthernNewEngland, SouthCentral, Southeast, West
Total TotalUS

6.7 Synthesise metric

Calculating the total of consumption by adding PLU4046, PLU4225, PLU4770, amount_from_small_bags, amount_from_large_bags, amount_from_large_bags.

avocado2 <- avocado2 %>% 
  mutate(total_consumption = PLU4046 + PLU4225 + PLU4770 + amount_from_small_bags + amount_from_large_bags + amount_from_xlarge_bags)

7 EXPLORATORY DATA ANALYSIS

7.1 Move to Houston?

Following graph shows the average prices of avocado in each city farmed by conventional method. Click the right button to see codes.

avocado2 %>% 
  filter(region_grouping == "city", type == "conventional") %>%           
  select(date, region, average_price) %>% 
  ggplot(aes(x = reorder(region, -average_price, na.rm = T), y = average_price)) +
  geom_jitter(aes(colour = region, alpha = 0.5)) +
  geom_violin(outlier.shape = NA, alpha = 0.5, size = 1) +
  geom_hline(yintercept = 1.5, linetype = 2) +
  geom_hline(yintercept = 1, linetype = 2) +
  annotate("rect", xmin = "LosAngeles", xmax = "PhoenixTucson", ymin = -Inf, ymax = Inf, alpha = 0.2) +
  geom_text(x = "WestTexNewMexico", y = 2.5, label = "My top 5 cities!", hjust = 0.5) +
  stat_summary(fun = "mean") +
  labs(x = "US City",
       y = "Avocado prices", 
       caption = "From 2015-01-04 to 2018-03-25 (1176 days)",
       title = "Figure 1. Violin plot of Non-organic Avocado Prices",
       subtitle = "Visual aids: (1) Black dots are average price of individual avocado by city between Jan-2015 to Mar-2018, (2) the plot has been ordered descendingly, (3) Body of violin become fatter when data points increase.") +
  theme_classic() + 
  theme(legend.position = "none", 
        axis.text.x = element_text(angle = 25, vjust = 0.65),
        plot.title = element_text(face = "bold", size = 15)) +
  scale_y_continuous(lim = c(0, 3), breaks = seq(0, 3, 0.5))

My top 5 cities of preference for non-organic avocado are:

avocado2 %>% 
  filter(region_grouping == "city", type == "conventional") %>%           
  select(date, region, average_price) %>% 
  group_by(region) %>% 
  summarise(Average_price_per_avocado = paste0("$ ", round(mean(average_price), 2)),
            Standard.deviation = paste0("$ ", round(sd(average_price), 2)),
            max = paste0("$ ", round(max(average_price), 2)),
            min = paste0("$ ", round(min(average_price), 2)),
            median = paste0("$ ", round(median(average_price), 2))) %>% 
  arrange(Average_price_per_avocado) %>% 
  slice(1:5) %>% 
  mutate(No. = row_number()) %>% 
  relocate(No., .before = region) %>% 
  rename(city = region) %>% 
  kbl(align = "c",
      table.attr = "style = 'widhth = 50;'") %>% 
  kable_classic_2("hover")
No.  city Average_price_per_avocado Standard.deviation max min median
1 PhoenixTucson $ 0.73 $ 0.18 $ 1.31 $ 0.46 $ 0.68
2 Houston $ 0.83 $ 0.15 $ 1.29 $ 0.51 $ 0.79
3 WestTexNewMexico $ 0.84 $ 0.13 $ 1.22 $ 0.52 $ 0.82
4 DallasFtWorth $ 0.85 $ 0.14 $ 1.3 $ 0.65 $ 0.82
5 LosAngeles $ 0.98 $ 0.24 $ 1.8 $ 0.53 $ 0.94

Looking good, there are Houston and Los Angeles. Good cities, near coastal, may be good for career opportunities as well.

Following graph shows the average prices of avocado in each city farmed by organic method. Click the right button to see codes.

avocado2 %>% 
  filter(region_grouping == "city", type == "organic") %>%           
  select(date, region, average_price) %>% 
  ggplot(aes(x = reorder(region, -average_price, na.rm = T), y = average_price)) +
  geom_jitter(aes(colour = region, alpha = 0.5)) +
  geom_violin(outlier.shape = NA, alpha = 0.5, size = 1) +
  geom_hline(yintercept = 1.5, linetype = 2) +
  geom_hline(yintercept = 1, linetype = 2) +
  annotate("rect", xmin = "CincinnatiDayton", xmax = "Houston", ymin = -Inf, ymax = Inf, alpha = 0.2) +
  geom_text(x = "Denver", y = 2.7, label = "My top 5 cities!", hjust = 0.5) +
  stat_summary(fun = "mean") +
  labs(x = "US City",
       y = "Avocado prices",
       caption = "From 2015-01-04 to 2018-03-25 (1176 days)",
       title = "Figure 2. Violin plot of Organic Avocado Prices",
       subtitle = "Visual aids: (1) Black dots are average price of individual avocado by city between Jan-2015 to Mar-2018, (2) the plot has been ordered descendingly, (3) Body of violin become fatter when data points increase.") +
  theme_classic() +
  theme(legend.position = "none", 
        axis.text.x = element_text(angle = 25, vjust = 0.65),
        plot.title = element_text(face = "bold", size = 15)) +
  scale_y_continuous(lim = c(0, 3), breaks = seq(0, 3, 0.5)) 

My top 5 cities of preference for organic avocado:

avocado2 %>% 
  filter(region_grouping == "city", type == "organic") %>%           
  select(date, region, average_price) %>% 
  group_by(region) %>% 
  summarise(Average_price_per_avocado = paste0("$ ", round(mean(average_price), 2)),
            Standard.deviation = paste0("$ ", round(sd(average_price), 2)),
            max = paste0("$ ", round(max(average_price), 2)),
            min = paste0("$ ", round(min(average_price), 2)),
            median = paste0("$ ", round(median(average_price), 2))) %>% 
  arrange(Average_price_per_avocado) %>% 
  slice(1:5) %>% 
  mutate(No. = row_number()) %>% 
  relocate(No., .before = region) %>% 
  rename(city = region) %>% 
  kbl(align = "c",
      table.attr = "style = 'width: 20;'") %>% 
  kable_classic_2("hover")
No.  city Average_price_per_avocado Standard.deviation max min median
1 Houston $ 1.27 $ 0.25 $ 1.92 $ 0.81 $ 1.26
2 DallasFtWorth $ 1.32 $ 0.2 $ 1.9 $ 0.86 $ 1.35
3 Denver $ 1.36 $ 0.35 $ 2.16 $ 0.66 $ 1.39
4 CincinnatiDayton $ 1.4 $ 0.34 $ 2.2 $ 0.44 $ 1.42
5 Roanoke $ 1.4 $ 0.3 $ 2.27 $ 0.7 $ 1.44

Looking good, Houston again, ranked 1 now and Dallas ranked number 2.

This section shows Houston might be the best city I should move to. Even if I consume organic avocado at the maximum price of $1.92 whole year round with a minimum of 1 avocado per day, the maximum amount of spending will be:

1.92*365
## [1] 700.8

It is still less than my budget of $800 / year.

However, I will carry out prediction later to ensure the price remains the similar for the next 24 months, hopefully.

7.2 Where are my avocado fellows?

Ii might be good to learn some avocado knowledge before going to the US, especially to find out which cities consume the most.

Following graph shows:

  • The top 5 cities consume non-organic Avocado the most are GreatLakes, Los Angeles, Plains, New York, and Dallas.
  • The top 5 cities consume organic Avocado the most are GreatLakes, Los Angeles, New York, Plains, and Seattle.
# set up the df

df7.2 <- avocado2 %>% 
  select(type, region_grouping, region, total_consumption) %>% 
  filter(region_grouping == "city") %>% 
  group_by(type, region) %>% 
  summarise(consumption = sum(total_consumption)) %>% 
  group_by(type) %>% 
  mutate(per = round(consumption/sum(consumption), 3)) %>% 
  mutate(type = fct_recode(type,
                           "Non-organic Avocado" = "conventional",
                           "Organic Avocado" = "organic"))
# plot the graph

ggplot(df7.2, aes(area = consumption)) +
  geom_treemap(aes(fill = type)) +
  geom_treemap_text(aes(label = paste(region, "\n", prettyNum(consumption, big.mark = ","), "\n", per*100, "%")), place = "center",
                    colour = "white") +
  labs(title = "Figure 3. Total Avocado Consumption in 45 US cities using Tree Map",
       caption = "Total consumption From 2015-01-04 to 2018-03-25 (1176 days)") +
  theme(legend.position = "none",
        strip.text = element_text(size = 15),
        strip.background = element_rect(fill = "white"),
        plot.title = element_text(size = 20, face = "bold", hjust= 0.5, vjust = 2)) +
  facet_wrap(~type)

Houston has a 4.5% consumption of avocado, though it is 3 times smaller than great Lakes but it has nearly 200 millions of consumption for nearly 3 years, it is tremendous. I am strongly confident that I would find my avocado fellows in the city.

7.3 How is the organic avocado consumption?

7.3.1 Consumption of organic avocado in the US

df7.3.1 <-  avocado2 %>% 
  filter(region_grouping == "city") %>% 
  group_by(type) %>% 
  summarise(overall_consumption = sum(total_consumption)) %>% 
  mutate(per = round(overall_consumption/sum(overall_consumption)*100, 2),
         per = paste0(per, "%")) %>% 
  mutate(type = fct_recode(type,
                           "Organic" = "organic",
                           "Non-Organic" = "conventional"))

ggplot(df7.3.1, aes(x = "", y = overall_consumption, fill = type)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(type, "\n", prettyNum(overall_consumption, big.mark = ","), "\n", per)), 
                position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y",
              start = 0) + 
  theme_minimal() +
  theme(legend.position = "none",
        axis.title = element_blank(),
        axis.text = element_blank(),
        plot.title = element_text(size = 11, hjust = 0.5, face = "bold")) +
  labs(title = "Figure 4. 97% of Avocado were Cosumed as Inorganic and 3% were Consumed as Organic",
       caption = "From 2015-01-04 to 2018-03-25 (1176 days)") +
  scale_fill_brewer(palette = "Set3")

7.3.2 Consumption of organic avocado in the top 10 consumption cities in the US

Drawing a quick bar chart to see how are the consumption different in the top 10 cities for two type of farming system.

df7.3.2 <- avocado2 %>% 
  select(type, region_grouping, region, total_consumption) %>% 
  filter(region_grouping == "city") %>% 
  group_by(type, region) %>% 
  summarise(consumption = sum(total_consumption)) %>% 
  group_by(type) %>% 
  mutate(per = round(consumption/sum(consumption), 3)) %>% 
  mutate(type = fct_recode(type,
                           "Non-organic Avocado" = "conventional",
                           "Organic Avocado" = "organic"))

df7.3.2 <- df7.3.2 %>% 
  group_by(type) %>% 
  arrange(desc(consumption)) %>% 
  slice(1:10)


ggplot(df7.3.2, aes(x = reorder(region, -consumption), y = consumption, fill = type)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~type, scale = "free_x") +
  theme_bw() + 
  theme(legend.position = "none" ,
        strip.text = element_text(size = 16),
        plot.title = element_text(face = "bold", size = 20),
        axis.text.x = element_text(angle = 25, vjust = 0.65, size = 12),
        axis.title = element_text(size = 16),
        axis.title.y = element_text(margin = margin(0, 15, 0, 0))) +
  labs(x = "US City",
       y = "Avocado consumption, per million", 
       title = "Figure 5. Top 10 US cities Avocado Consumption by Bar Chart",
       subtitle = "Consumption of non-organic Avocado is way more than organic Avocado at even 25 times differet in GreatLakes",
       caption = "Total consumption From 2015-01-04 to 2018-03-25 (1176 days)") +
  scale_y_continuous(lim = c(0, 600000000), 
                     breaks = seq(0, 600000000, 50000000),  
                     labels = function(x){x/1000000}) +  
  geom_text(aes(label = paste0(round(consumption/1000000), " m")),  
            vjust = -0.6) +
  geom_vline(xintercept = "Houston", linetype = 2, alpha = 0.5, colour = "grey5")

  • This graph shows that top 10 avocado consumption cities in the US prefer conventionally farmed avocado, and the reason behind might be due to cheaper price, based on this data set.
  • In Houston, 98% of avocado consumers consumed non-organic avocado, and only 2% consume organic Avocado.
  • In Houston, the consumption of organic avocado is 50 times lesser than the consumption of non-organic avocado.

7.4 How are people buying their avocados in the US?

It will be interesting to see how are my fellow avocado lovers buying their fruits in the US, and especially in the Houston.

From the data set, there are variables categorising how avocados were sold, such as PLU4046 which stands for small or medium Hass avocado, PLU4225 stands for large Hass avocado, PLU4770 stands for extra large Hass avocado, and avocados that sold in small, large and xlarge bags.

Recall: Although the small, large and xlarge bags columns have their names suggest that the values within them have a unit of “per bag”, however the unit is actually per avocado, not in the unit of bag.- see section 5.

# df preparation and data manipulation for pivot_longer

df7.4 <-  avocado2 %>% 
  filter(region_grouping == "city") %>% 
  select(type, region, PLU4046, PLU4225, PLU4770, amount_from_small_bags, amount_from_large_bags, amount_from_xlarge_bags) %>% 
  pivot_longer(c("PLU4046", "PLU4225", "PLU4770", "amount_from_small_bags", "amount_from_large_bags", "amount_from_xlarge_bags"),
               names_to = "category",
               values_to = "consumption_per_day")


df7.4.2 <- df7.4 %>% 
  group_by(type, category) %>% 
  summarise(Total = sum(consumption_per_day)) %>% 
  mutate(category = factor(category,
                           levels = c("PLU4046", "PLU4225", "PLU4770", "amount_from_small_bags", "amount_from_large_bags", "amount_from_xlarge_bags")),
         category = fct_recode(category,
                               "large_bags" = "amount_from_large_bags",
                               "small_bags" = "amount_from_small_bags",
                               "Xlarge_bags" = "amount_from_xlarge_bags"),
         type = fct_recode(type, 
                           "Non-organic Avocado" = "conventional",
                           "Organic Avocado" = "organic")) %>% 
  group_by(type) %>% 
  mutate(per = paste0(round(Total/sum(Total)*100, 2), "%"))



ggplot(df7.4.2, aes(x = "", y = Total, fill = category)) +
  geom_bar(stat = "identity") + 
  coord_polar(theta = "y", start = 0) + 
  geom_text(aes(label = paste0(category, "\n",
                              prettyNum(Total, big.mark = ","), "\n",
                               per)), 
            position = position_stack(vjust = 0.5),
            colour = "black") + 
  theme_minimal() + 
  theme(legend.position = "none",
        axis.title = element_blank(),
        strip.text = element_text(size = 15),
        plot.title = element_text(size = 22, face = "bold", vjust = 1,
                                  hjust = 0.5),
        plot.subtitle = element_text(size = 14, hjust = 0.5),
        axis.text = element_blank()) +  
  facet_wrap(~type, switch = "x") +  
  labs(title = "Figure 6. Pie chart comparing the ways all avocados sold in the US",
       subtitle = "Total consumption From 2015-01-04 to 2018-03-25 (1176 days)")

US people loves buying PLU4046 (32.2%) and PLU4225 (36%) non-organic avocado, they stands for “small or medium” size avocado and “large size” avocado. Not many Americans like extra large size avocado (PLU4770) with only 3.47%. For packaging size, people in the US love buying avocados packed in small bags at 21.69%, but not really prefer large and extra large packages.

See the relevant statistics:

df7.4.2 %>% 
  filter(type == "Non-organic Avocado") %>% 
  arrange(desc(Total)) %>% 
  kbl(align = "c",
      caption = "Overall consumption of Non-organic Avocado in the US") %>% 
  kable_styling(bootstrap_options = c("hover", "bordered")) 
Overall consumption of Non-organic Avocado in the US
type category Total per
Non-organic Avocado PLU4225 1594976994 36.03%
Non-organic Avocado PLU4046 1425379408 32.2%
Non-organic Avocado small_bags 959950034 21.69%
Non-organic Avocado large_bags 272024659 6.15%
Non-organic Avocado PLU4770 153394014 3.47%
Non-organic Avocado Xlarge_bags 20943726 0.47%

Above pie chart shows how small the consumption of organic avocado is as compared to non-organic avocado. The pies in the pie chart of the organic avocado might be too small to view, following pie chart is an extension of that chart.

ggplot(df7.4.2 %>% filter(type == "Organic Avocado"), aes(x = "", y = Total, fill = category)) +
  geom_bar(stat = "identity") + 
  coord_polar(theta = "y", start = 0) + 
  geom_text(aes(label = paste0(category, "\n",
                              prettyNum(Total, big.mark = ","), "\n",
                               per)), 
            position = position_stack(vjust = 0.5),
            colour = "black") + 
  theme_minimal() + 
  theme(legend.position = "none",
        axis.text = element_blank(),
        axis.title = element_blank(),
        plot.title = element_text(face = "bold", size = 15),
        strip.text = element_text(size = 14),
        plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"),
        plot.subtitle = element_text(hjust = 0.5)) +
  facet_wrap(~ type, switch = "x") +
   labs(title = "Figure 7. Pie chart comparing the ways Organic avocados sold in the US",
        subtitle = "Total consumption From 2015-01-04 to 2018-03-25 (1176 days)")

df7.4.2 %>% 
  filter(type == "Organic Avocado") %>% 
  group_by(type) %>% 
  arrange(desc(Total), .by_group = TRUE) %>%  
  kbl(align = "c",
      caption = "Overall consumption of Organic Avocado in the US") %>% 
  kable_styling(bootstrap_options = c("hover", "bordered"))
Overall consumption of Organic Avocado in the US
type category Total per
Organic Avocado PLU4225 49627505 36.55%
Organic Avocado small_bags 45156090 33.26%
Organic Avocado large_bags 22261124 16.4%
Organic Avocado PLU4046 17973594 13.24%
Organic Avocado PLU4770 754630 0.56%
Organic Avocado Xlarge_bags 3214 0%

In the organic avocado industry of the US, PLU4225 Avocado (small and medium size) is also the most popular avocados, extra large avocado size (PLU4770) and extra large bags are still the least preferred option, it is similar to the non-organic avocado industry. Organic consumers also love their avocado packed in small bags and large bags as compared to non-organic consumer.

7.5 How are people buying their avocado in Houston?

According to figure 4 in section 7.3, 98% of avocado consumers in Houston consumed non-organic avocado, and only 2% consume organic avocado, between 2015-01-04 to 2018-03-25. Among the 98% non-organic avocado consumers, about 50% of them prefer small and medium size avocado (PLU4046), about 24% of them prefer large medium size avocado (PLU4225), and followed by 15.2% preference of avocados packed in small bags.

df7.5 <- df7.4 %>% 
  filter(region == "Houston") %>% 
  group_by(type, category) %>% 
  summarise(Total = sum(consumption_per_day)) %>% 
  mutate(category = factor(category,
                           levels = c("PLU4046", "PLU4225", "PLU4770", "amount_from_small_bags", "amount_from_large_bags", "amount_from_xlarge_bags")),
         category = fct_recode(category,
                               "large_bags" = "amount_from_large_bags",
                               "small_bags" = "amount_from_small_bags",
                               "Xlarge_bags" = "amount_from_xlarge_bags"),
         type = fct_recode(type, 
                           "Non-organic Avocado" = "conventional",
                           "Organic Avocado" = "organic")) %>% 
  group_by(type) %>% 
  mutate(per = paste0(round(Total/sum(Total)*100, 2), "%"))

ggplot(df7.5, aes(x = "", y = Total, fill = category)) +
  geom_bar(stat = "identity") +
  coord_polar(theta = "y",
              start = 0) + 
  geom_text(aes(label = paste0(category, "\n",
                              prettyNum(Total, big.mark = ","), "\n",
                               per)), 
            position = position_stack(vjust = 0.5),
            colour = "black") + 
  facet_wrap(~ type, switch = "x") +
  theme_minimal() +
  scale_fill_brewer(palette = "OrRd") +
    theme(legend.position = "none",
        axis.title = element_blank(),
        strip.text = element_text(size = 20),
        plot.title = element_text(size = 22, face = "bold", vjust = 1,
                                  hjust = 0.5),
        plot.subtitle = element_text(size = 15, hjust = 0.5),
        axis.text = element_blank()) +
  labs(title = "Figure 8. Pie chart comparing the ways all avocados sold in the Houston",
       subtitle = "Total consumption From 2015-01-04 to 2018-03-25 (1176 days)")

df7.5 %>% 
  filter(type == "Non-organic Avocado") %>% 
  arrange(desc(Total)) %>% 
  kbl(align = "c",
      caption = "Non-organic Avocado consumption with selling methods in Houston") %>% 
  kable_styling(bootstrap_options = c("hover", "bordered"))
Non-organic Avocado consumption with selling methods in Houston
type category Total per
Non-organic Avocado PLU4046 98198811 49.3%
Non-organic Avocado PLU4225 47696206 23.94%
Non-organic Avocado small_bags 30272429 15.2%
Non-organic Avocado large_bags 17253877 8.66%
Non-organic Avocado PLU4770 5455463 2.74%
Non-organic Avocado Xlarge_bags 326131 0.16%

Whereas, among the 2% of organic consumers, they prefer their organic avocado packed in small bags (56.82%), then prefer small and medium size avocado (PLU4046), it was ranked the first preference for non-organic consumer. Again, extra large size avocado and extra large packing-bags are the least preferred option.

ggplot(df7.5 %>% filter(type == "Organic Avocado"), aes(x = "", y = Total, fill = category)) +
  geom_bar(stat = "identity") +
  coord_polar(theta = "y",
              start = 0) + 
  geom_text(aes(label = paste0(category, "\n",
                              prettyNum(Total, big.mark = ","), "\n",
                               per)), 
            position = position_stack(vjust = 0.5),
            colour = "black") + 
  facet_wrap(~ type) +
  theme_minimal() +
  scale_fill_brewer(palette = "OrRd") +
    theme(legend.position = "none",
        axis.title = element_blank(),
        strip.text = element_text(size = 20),
        plot.title = element_text(size = 22, face = "bold", vjust = 1,
                                  hjust = 0.5),
        plot.subtitle = element_text(size = 15, hjust = 0.5),
        axis.text = element_blank()) +
  facet_wrap(~ type, switch = "x") +
  labs(title = "Figure 9. Pie chart comparing the ways Organic avocados sold in the Houston",
       subtitle = "Total consumption From 2015-01-04 to 2018-03-25 (1176 days)")

df7.5 %>% 
  filter(type == "Organic Avocado") %>% 
  arrange(desc(Total)) %>% 
  kbl(align = "c",
      caption = "Organic Avocado consumption with selling methods in Houston") %>% 
  kable_styling(bootstrap_options = c("hover", "bordered"))
Organic Avocado consumption with selling methods in Houston
type category Total per
Organic Avocado small_bags 2252737 56.82%
Organic Avocado PLU4046 1574086 39.7%
Organic Avocado large_bags 109883 2.77%
Organic Avocado PLU4225 28229 0.71%
Organic Avocado PLU4770 9 0%
Organic Avocado Xlarge_bags 0 0%

8 Predicting Avocado price in Houston

This section I will first predict how are avocados prices in the US then followed by Houston. If predict prices in Houston exceed my budget, I will continue to search and predict other cities.

This section I will predict:

  • Non-organic avocado prices in the US 24 months after 2018-03-25
  • Organic avocado prices in the US 24 months after 2018-03-25
  • Non-organic avocado prices in the Houston 24 months after 2018-03-25
  • Organic avocado prices in the Houston 24 months after 2018-03-25

8.1 Forecast: Non-organic Avocado prices in US

This time plot shows the prices of non-organic avocado was relatively constant between 2015 and 2016, but followed by big fluctuations in 2016 and 2017. The graph seems not stationary and had a strong trend of increment of price per avocado over time though it dropped drastically in the end of 2017 and stabilised in the start of 2018. Seasonality is definitely present in the August, September, and October of 2016 and 2017.

conv_us <- avocado2 %>% 
  filter(region_grouping == "city",
         type == "conventional") %>% 
  group_by(date) %>% 
  summarise(average_price = mean(average_price))
  
# set up ts

conv_us_ts <- ts(conv_us$average_price,
                 start = c(2015, 1),
                 frequency = 52) 
# plot

autoplot(conv_us_ts) +
  labs(title = "Time plot: non-organic avocado weekly prices in the US",
       subtitle = "Averaged from all cities",
       y = "$") +
  geom_point(colour = "brown", shape = 21) +
  geom_path(colour = "brown")

The differencing plot below shows steep changes of prices near the beginning and end of 2017, suggesting there may be a seasonality.

conv_us_ts_d <- diff(conv_us_ts)
autoplot(conv_us_ts_d) + geom_point(colour = "brown", shape = 21) +
  labs(title = "Differencing plot: Change in weekly prices of non-organic avocado in the US",
       y = "$") 

Apart from visualising stationarity and seasonality, I can apply some statistical tests that give P-values to aid my observation:

  • Stationarity test has a P-value of 0.1563, rejecting the null hypothesis of the method and concludes a non-stationarity result. It matches my thought from the above graphs.
PP.test(conv_us_ts)
## 
##  Phillips-Perron Unit Root Test
## 
## data:  conv_us_ts
## Dickey-Fuller = -3.006, Truncation lag parameter = 4, p-value = 0.1563
# Null hypothesis: Not stationary
# Fail to reject, so the object is not stationary. 
  • Seasonality test rejects the null hypothesis of the test and conclude that the trend in the plot is seasonal. It again supports my thoughts that the dataset has seasonality.
summary(wo(conv_us_ts))
## Test used:  WO 
##  
## Test statistic:  1 
## P-value:  3.39749e-07 1.786835e-05 0.01781721 
##  
## The WO - test identifies seasonality

Looking at exponential smoothing models (ETS), I know this method cannot be used for frequency that is higher than 24. My frequency is 52 becasuse I am forecasting base on weekly data and there are 52 weeks for a year. Click the right button.

conv_us_ts_ETS <- ets(conv_us_ts)
## Warning in ets(conv_us_ts): I can't handle data with frequency greater than 24.
## Seasonality will be ignored. Try stlf() if you need seasonal forecasts.

I am now attempting to try out ARIMA models.

conv_us_ts_arima <- auto.arima(conv_us_ts,
                         d = 1, 
                         stepwise = F,
                         approximation = F)

Ljung box test reveal a p-value of 0.7584, so I safely say there is no autocorrelation from this model and suggesting it is safe to proceed to forcasting.

Box.test(conv_us_ts_arima$residuals, lag = 52, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  conv_us_ts_arima$residuals
## X-squared = 44.561, df = 52, p-value = 0.7584

I can also visualise the autocorrelation by looking at the ACF-log plot. There is a little bit of autocorrelation left when Lag = 10. It should be fine as it is minor and supported by the result of Ljung box test. Additionallu, it is good to see that I have a normally distributed residuals.

checkresiduals(conv_us_ts_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0)(1,0,0)[52]
## Q* = 24.408, df = 33, p-value = 0.8605
## 
## Model df: 1.   Total lags used: 34
conv_us_ts_arima_fc <- forecast(conv_us_ts_arima, h = 104)

autoplot(conv_us_ts_arima_fc) +
  geom_hline(yintercept = 2, linetype = 2, colour = "blue") +
  labs(subtitle = "Prediction of weekly prices of non-organic avocado prices in the US",
       y = "$") +
  geom_text(x = ymd(2018-01-01), y = 2.1, label = "hi")

  • This forecast has nearly 95.93% of accuracy (100 - MAPE) base on the given data.
summary(conv_us_ts)   # or accuracy(organic_AA)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8296  1.0511  1.1209  1.1636  1.2738  1.7233
  • The darker blue area is the 95% confidence interval, and non-organic avocado prices fall within $2 within the next two years after 2018.

8.2 Forecast: Organic Avocado prices in US

Applying the same concepts as in the previous section.

org_us <- avocado2 %>% 
  filter(region_grouping == "city",
         type == "organic") %>% 
  group_by(date) %>% 
  summarise(average_price = mean(average_price))
  
# set up ts

org_us_ts <- ts(org_us$average_price,
                 start = c(2015, 1),
                 frequency = 52) 
# plot

autoplot(org_us_ts) +
  labs(title = "Time plot: organic avocado weekly prices in the US",
       subtitle = "Averaged from all cities",
       y = "$") +
  geom_point(colour = "darkgreen", shape = 21) +
  geom_path(colour = "darkgreen")

  • It is not stationary base on PP-test.
PP.test(org_us_ts)
## 
##  Phillips-Perron Unit Root Test
## 
## data:  org_us_ts
## Dickey-Fuller = -2.5455, Truncation lag parameter = 4, p-value = 0.3486
# Not stationary
  • It is not seasonal, though there are 2 peaks in 2016 and 2017. The data set needs more years to prove it is seasonal.
summary(wo(org_us_ts))
## Test used:  WO 
##  
## Test statistic:  0 
## P-value:  1 1 0.7244975 
##  
## The WO - test does not identify  seasonality

I will personally reject to follow this P-value as proved by my another plot (subseries plot) below, it shows possibility of seasonality. It might be masked by noises.

ggsubseriesplot(org_us_ts) +
  geom_point(colour = "darkgreen", fill = "green", shape = 21, size = 3) +
  labs(title = "Subseries plot: The changes of Yearly prices in each month",
       y = "$",
       subtitle = "Between week 34 to 43, most years seems to have slightly increase in prices and is suggesting a possible seasonality.") +
  annotate("rect", xmin = 34, xmax = 43, ymin = -Inf, ymax = Inf, fill = "yellow", alpha = 0.1)

Hopefully, auto ARIMA should pick up the trend and seasonality and give a suitable model without autocorrelation.

org_us_ts_arima <- auto.arima(org_us_ts, 
                              d = 1,
                              stepwise = F,
                              approximation = F)

There are autocorrelation exists in the ACF-Lag plots of the chosen ARIMA model and suggesting there be data remain unexplained if forecast is to be proceeded. LJung-box test also supporting this result with P-value less than 0.05.

checkresiduals(org_us_ts_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,3)(0,1,0)[52]
## Q* = 100.59, df = 30, p-value = 1.497e-09
## 
## Model df: 4.   Total lags used: 34

Declaring in the forecast of the existence of autocorrelation during ACF-Lag test.

org_us_ts_arima_fc <- forecast(org_us_ts_arima, h = 104)
autoplot(org_us_ts_arima_fc) +
  labs(subtitle = "Prediction of weekly prices of organic avocado in the US",
       caption = "This is the best model of ARIMA but accuracy affected by autocorrelation.",
       y = "$") +
  geom_hline(yintercept = 2, linetype = 2, colour = "blue")

8.3 Forecast: Non-organic Avocado prices in Houston

Applying the same concept as previous sections. Plotting the time plot first to observe trend stationary and seasonality. From observation, there are a lot of noises in the graph making seasonality vague, though there are a couple of obvious peaks between 2016 to 2017, and 2017 to 2018. There seems like a gradual positive trend though the price drops back to really low of near $0.55 near the end of the data trend.

conv_houston <- avocado2 %>% 
  filter(region == "Houston",
         type == "conventional") %>% 
  group_by(date) %>% 
  summarise(average_price = mean(average_price))
  
# set up ts   

conv_houston_ts <- ts(conv_houston$average_price,
                 start = c(2015, 1),
                 frequency = 52) 
# plot

autoplot(conv_houston_ts) +
  labs(title = "Time plot: non-organic avocado weekly prices in Houston",
       y = "$") +
  geom_point(colour = "brown", shape = 21) +
  geom_path(colour = "brown")

I have a p-value from stationary test showing a result of higher than 0.05, and the test says the graph is not stationary.

PP.test(conv_houston_ts)
## 
##  Phillips-Perron Unit Root Test
## 
## data:  conv_houston_ts
## Dickey-Fuller = -3.2503, Truncation lag parameter = 4, p-value =
## 0.08177

There are too much noises in the graph, and the seasonality tests does not identify seasonality in the graph.

summary(wo(conv_houston_ts))
## Test used:  WO 
##  
## Test statistic:  0 
## P-value:  1 1 0.05514139 
##  
## The WO - test does not identify  seasonality

Applying the auto_arima functions in R to choose the best model.

conv_houston_ts_arima <- auto.arima(conv_houston_ts,
                                    d = 1,
                                    approximation = F,
                                    stepwise = F,
                                    trace = T)

Testing the model for autocorrelation existence. The ACF-Lag graph shows that the model can explain most of the data in the dataset, only left with just a few, minor autocorrelation. This suggesting I am safe to proceed with forecasting.

checkresiduals(conv_houston_ts_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0)(0,0,1)[52]
## Q* = 31.227, df = 33, p-value = 0.5556
## 
## Model df: 1.   Total lags used: 34

Creating the forecast of non-organic avocado in the Houston for next 2 years.

conv_houston_ts_arima_fc <- forecast(conv_houston_ts_arima, h = 104)

autoplot(conv_houston_ts_arima_fc) + labs(subtitle = "Prediction of weekly prices of non-organic avocado in Houston",
       y = "$") +
  geom_hline(yintercept = 2, linetype = 2, colour = "blue")

The accuracy of this forecast is 92.97% (1 - MAPE)

summary(conv_houston_ts_arima)
## Series: conv_houston_ts 
## ARIMA(0,1,0)(0,0,1)[52] 
## 
## Coefficients:
##         sma1
##       0.1864
## s.e.  0.0965
## 
## sigma^2 estimated as 0.006018:  log likelihood=190.7
## AIC=-377.39   AICc=-377.32   BIC=-371.15
## 
## Training set error measures:
##                        ME       RMSE        MAE        MPE     MAPE      MASE
## Training set -0.001014761 0.07711343 0.05503131 -0.6660946 7.025728 0.4368157
##                     ACF1
## Training set -0.04668654
  • The plot shows the 95% confidence interval (darker blue area) in the forecast zone does not exceed $2 for the next 2 years after 2018.
  • It is a good sign. My threshold is to hope for avocado price not to exceed $2 per unit. So that I can consume minimum 1 per day (set in the scenario) and meeting the budget of 800 dollars per year.

8.4 Forecast: Organic Avocado prices in Houston

Hopefully organic avocado has a good range of prices in next two years.

Time plot showing the overall weekly changes of avocado price from 2015-01-04 to 2018-03-25.

# set up df for this section

org_houston <- avocado2 %>% 
  filter(type == "organic",
         region == "Houston") %>% 
  arrange(date)

# set up ts df

org_houston_ts <- ts(org_houston$average_price,
                 frequency = 52,      # My data is weekly. 
                 start = c(2015, 1))                      

#Plot

autoplot(org_houston_ts) +
  labs(title = "Time plot: organic avocado weekly prices in Houston",
       caption = "Total consumption From 2015-01-04 to 2018-03-25 (1176 days)",
       y = "Price per Avocado ($)") +
  geom_point(colour = "darkgreen", shape = 21) +
  geom_path(colour = "darkgreen")

From the first year of our data, organic price of avocado in the first 6 months of 2015 was relatively constant, however, the price fluctuated drastically in the next 2 to 3 years.

Stationary test fails to reject the null hypothesis (P-value = 0.2311) and says the trend is not stationary.

PP.test(org_houston_ts)
## 
##  Phillips-Perron Unit Root Test
## 
## data:  org_houston_ts
## Dickey-Fuller = -2.8269, Truncation lag parameter = 4, p-value = 0.2311

There is no indication of seasonality from its P-value.

summary(wo(org_houston_ts))
## Test used:  WO 
##  
## Test statistic:  0 
## P-value:  1 1 0.7900466 
##  
## The WO - test does not identify  seasonality

Exponential Smoothing models (ETS)

I am apply ETS modeling using following codes. However, this model can’t handle frequency greater than 24, I have a frequency of 52 because my data is weekly data. I will then choose another model for my forecasting.

org_houston_ts_ets <- ets(org_houston_ts)
## Warning in ets(org_houston_ts): I can't handle data with frequency greater than
## 24. Seasonality will be ignored. Try stlf() if you need seasonal forecasts.

ARIMA Model

Applying ARIMA model for my forecast.

org_houston_ts_arima <- auto.arima(org_houston_ts,
                               d = 1, 
                               stepwise = F,
                               approximation = F,
                               trace = T)

Residuals-check shows that ACF-Lag has not much autocorrelation left, the model can explain most of the data and is using the data quite well. There is a normal distribution of residuals as well.

checkresiduals(org_houston_ts_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2)
## Q* = 31.381, df = 32, p-value = 0.4977
## 
## Model df: 2.   Total lags used: 34

Testing Ljung-Box test, p-value is greater than 0.05, then I safely say no autocorrelation is present. I can safely use this model for the purpose of forecasting.

Box.test(org_houston_ts_arima$residuals, lag = 52, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  org_houston_ts_arima$residuals
## X-squared = 42.259, df = 52, p-value = 0.8304

Synthesising the forecast graph.

org_houston_ts_arima_fc <- forecast(org_houston_ts_arima)

autoplot(org_houston_ts_arima_fc, h = 104) +
  labs(subtitle = "Prediction of weekly prices of non-organic avocado in Houston",
       y = "$") +
  geom_vline(xintercept = 2019, linetype = 2, colour = "red") +
  geom_hline(yintercept = 2, linetype = 2, colour = "blue") +
  geom_text(x = 2.5, y = 2019)

This model has nearly 95% (94.9515) of accuracy (100 - MAPE).

summary(org_houston_ts_arima)   # or accuracy(organic_AA)
## Series: org_houston_ts 
## ARIMA(0,1,2) 
## 
## Coefficients:
##          ma1      ma2
##       0.1212  -0.2221
## s.e.  0.0789   0.0830
## 
## sigma^2 estimated as 0.01007:  log likelihood=148.8
## AIC=-291.61   AICc=-291.46   BIC=-282.24
## 
## Training set error measures:
##                        ME       RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0007205485 0.09945763 0.0629353 -0.301139 5.048496 0.1932152
##                    ACF1
## Training set 0.01370076

Bad news is that after 2019, price per organic avocado may exceed $2. And I might switch to non-organic avocado.

9 CONCLUSION

  • Phoenix Tuscon, Houston, West Tex New Mexico, Dallas, and Los Angeles are the top 5 cities for cheap non-organic avocados, with maximum price among them being $1.8 across the period given in the data set.
  • Houston, Dallas, Denver, Cincinnati Dayton, Roanoke, and West Tex New Mexico are the top 5 cities for cheap organic avocados, with maximum price among them being $2.27.
  • 97% of avocados in the US were sold as non-organic avocado, and organic avocado consumption was only 3%.
  • In US, Great Lakes, Los Angeles and Plains are the top 3 cities consumed non-organic avocado the most at 12.8%, 11.2%, 6.8%.
  • In US, Great Lakes, Los Angeles, New York are the top 3 cities consumed organic avocado the most at 16.2%, 9.9%, and 6.6%.
  • Most popular non-organic avocado selling method in the US is to sell avocados in small and medium size (PLU4225) with popularity at 36%, followed by large size avocado (PLU4046) 32.2%, and small package (21.69%). Large package, extra large avocado (PLU4770), and extra large package are the least prefer selling method.
  • Organic avocado consumers also prefer small and medium size avocado (PLU4225), at 36.55%, then followed by small package at 33.26%, large bags at 16.4%, then PLU4046 and PLU4770 at 13.24% and 0.56% respectively. No extra large packages were sold.
  • Picked Houston as the city of first preference athe city is the second cheapest non-organic avocado city, the cheapest organic avocado city, and is one of the top 10 most avocado-consuming cities.
  • From my forecast, 95% of non-organic avocado prices would fall below my threshold of 2 dollars per unit avocado in the upcoming 24 months after 2018 March 25. However, for organic avocado, 95% of chance that their prices may exceed 2 dollars seasonally.
  • From my forecast, 95% of non-organic avocado prices in Houston would fall below my threshold of $2 per unit, but for organic avocado, 95% of chance that the price may exceed 2 dollars after 2019.

The best city for me to move to is Houston, I am buying organic avocados until the end of 2019, then switch to non-organic ones. These conditions fulfilled my desire of at least 1 avocado per day, not exceeding threshold of $800 per year, and in the same time being environmental friendlier.

10 LEGALITY

This is a personal project created and designed for non-commercial use only. The data set of this personal project was introduced by Google data analytic certificate course and was made available for students for some analytics practices in the course. This is a public data set from Kaggle. Please visit Topic 4 in this project for the link. This project extends the analysis scope of the course. This personal project was not submitted and graded by Google. Google and I, nor any stakeholders relevant to the creation of this project held any responsibility for any outcomes affected from using the results of this project. This project is only for educational and skill demonstration purposes.

All photos in this project, such as those as the thumbnail are just for demonstration only, they have nothing to do with the data set, the location where the data were collected, and the results of this analysis.

11 REFERENCE

the Hass Avocado Board n.d., How to Identify Hass Avocadoes, viewed 18 June 2021, 2021,https://loveonetoday.com/how-to/identify-hass-avocados/

Overview of Avocado Datasets n.d., viewed 18 June 2021, https://cran.r-project.org/web/packages/avocado/vignettes/a_intro.html

Russell M 2020, Avocado New Zealand avocado returns high despite COVID-19 affected season, viewed 18 June 2021, https://www.freshplaza.com/article/9233181/new-zealand-avocado-returns-high-despite-covid-19-affected-season/