Photo by Russell 2020
Reading time: 29 minutes
This is a personal side project using an avocado public data set introduced in the 6-month extensive Google Data Analytics Professional Certificate program. The goal of this project is to identify which city in the US has the cheapest avocado and is the price able to fit within a budget of $800 a year with a consumption rate of minimum 1 avocado per day. Another goal is to predict the trend of avocado prices, organic and non-organic, in the future 24 months and hopefully their prices in the chosen city remain budget-friendly.
Numerous graphs were created during EDA and prediction, include box plots, tree maps, bar charts, pie charts, different types of time plot (general, differences, sub-series), plots for ACF-Log, histograms for normality and forecasting graphs. Forecasting touches on seasonality and stationary tests using pp-test, wo-test, and autocorrrelation test using Ljung-Box test. There are 4 forecasting activities on predicting the price of non-organic and organic avocado prices in the US, as well as in the city of chosen in the next 24 months.
Results reveal that 97% of avocados in the US were sold as non-organic avocado, and 3% as organic. Most popular avocados are small and medium size (PLU4225). Picking Houston as the city of first preference to move to as the city is the second cheapest non-organic avocado city, the cheapest organic avocado city, and the top 10 most avocado-consuming cities. From my forecast, 95% of non-organic avocado prices in Houston will fall below my threshold of 2 dollars per avocado, but for organic avocado, 95% of chance that the price may exceed 2 dollars after 2019. Base on this data set and results, I will move to Houston, consume organic avocado until the end of 2019, then switch to non-organic ones. These conditions fulfilled my desire of at least 1 avocado per day and not exceeding threshold of $800 per year, and in the same time being environmental friendlier.
R packages loaded in this projects include tidyverse packages (ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats), skimr, lubridate, kableExtra, ggrepel, hms, treemapify, fpp2, and seastests.
library(tidyverse)
library(skimr)
library(lubridate)
library(kableExtra)
library(ggrepel)
library(hms)
library(treemapify)
library(fpp2)
library(seastests)
In this scenario - (1) I am an avocado lover, (2) I must consume minimum of 1 avocado per day, and (3) I am migrating to US next year to experience US life.
I would like to move to a US city with cheap avocado. Despite I would like to eat as more as possible but I still have a budget of $800 per year allocated for avocado consumption.
In my plan, I would like find out the top-5 cities where avocados are the cheapest and hopefully the cheap price remains similar in the next 2 years after I move there. It will be great if I can contribute something for the environment by eating organic avocado. Of course, if the organic price is within my budget.
I hope this project can give me some insightful answers!
Data set used in this personal project has an official name known as “Avocado Prices”. The data set was introduced by the 6-month extensive Google Professional Data Analytics Certificate that I completed on 31 May 2021. This data set was used by the program in a few parts of its data analysis activities.
Instructed by the Google program, the data set was downloaded from Kaggle. Kaggle is a great website for data science community, click this link to the page of the dataset. The data set was created and uploaded by Justin Kiggins onto Kaggle and the data was originally from the Hass Avocado Board website in May 2018, click here to the Hass Avocado Board website.
This section imports the “Avocado Price” data set into R with a given data set name called “avocado”. Following table shows the first 10,000 rows of raw data within the data set.
avocado <- read_csv("avocado.csv")
avocado
The data set contains important information such as dates, avocado average prices, amounts sold, 3 PLU codes of avocados, type of the produce and associated region. Check out following table for these details. The information in the description column are from Kaggle.
Variable <- c("X1",
"Date",
"AveragePrice",
"Total Volume",
"4046",
"4225",
"4770",
"Total Bags",
"Small Bags",
"Large Bags",
"XLarge Bags",
"type",
"year",
"region")
Description <- c("Index",
"The date of the observation",
"The average price of a single avocado, even when multiple avocados are sold in bags",
"Total number of avocados sold",
"Total number of avocados sold by 4046, it is a Product Lookup (PLU) code",
"Total number of avocados sold by 4225, it is a Product Lookup (PLU) code",
"Total number of avocados sold by 4770, it is a Product Lookup (PLU) code",
"Missing information (Kar: The unit is actually per avocado instead of per bag- see section 5)",
"Missing information (Kar: The unit is actually per avocado, see section 5)",
"Missing information (Kar: It should be the total number of avocado sold, not number of bags, see section 5)",
"Missing information (Kar: It should be the total number of avocado sold, not number of bags, see section 5)",
"conventional or organic",
"the year",
"the city or region of the observation")
data.frame(Variable, Description) %>%
kbl(align = "l",
table.attr = "style = 'width: 60;'") %>%
kable_styling(bootstrap_options = c("hover",
"bordered",
"striped"))
| Variable | Description |
|---|---|
| X1 | Index |
| Date | The date of the observation |
| AveragePrice | The average price of a single avocado, even when multiple avocados are sold in bags |
| Total Volume | Total number of avocados sold |
| 4046 | Total number of avocados sold by 4046, it is a Product Lookup (PLU) code |
| 4225 | Total number of avocados sold by 4225, it is a Product Lookup (PLU) code |
| 4770 | Total number of avocados sold by 4770, it is a Product Lookup (PLU) code |
| Total Bags | Missing information (Kar: The unit is actually per avocado instead of per bag- see section 5) |
| Small Bags | Missing information (Kar: The unit is actually per avocado, see section 5) |
| Large Bags | Missing information (Kar: It should be the total number of avocado sold, not number of bags, see section 5) |
| XLarge Bags | Missing information (Kar: It should be the total number of avocado sold, not number of bags, see section 5) |
| type | conventional or organic |
| year | the year |
| region | the city or region of the observation |
According to the Hass Avocado Board website, 3 of the above mentioned Product Lookup (PLU) codes are:
PLU <- c("4046", "4225", "4770")
Category <- c("Small/Medium Hass Avocado (~3-5oz avocado)",
"Large Hass Avocado (~8-10oz avocado)",
"Extra Large Hass Avocado (~10-15oz avocado)")
data.frame(PLU, Category) %>%
kbl(align = "l",
table.attr = "style = 'width: 40;'") %>%
kable_classic(c("hover", "condense"))
| PLU | Category |
|---|---|
| 4046 | Small/Medium Hass Avocado (~3-5oz avocado) |
| 4225 | Large Hass Avocado (~8-10oz avocado) |
| 4770 | Extra Large Hass Avocado (~10-15oz avocado) |
This data set has 18,249 rows of data (or known as observation) and 14 variables. It has 1 character column, 1 date and 11 numeric columns.
skim_without_charts(avocado)
| Name | avocado |
| Number of rows | 18249 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| Date | 1 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| type | 0 | 1 | 7 | 12 | 0 | 2 | 0 |
| region | 0 | 1 | 4 | 19 | 0 | 54 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| Date | 0 | 1 | 2015-01-04 | 2018-03-25 | 2016-08-14 | 169 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| X1 | 0 | 1 | 24.23 | 15.48 | 0.00 | 10.00 | 24.00 | 38.00 | 52.00 |
| AveragePrice | 0 | 1 | 1.41 | 0.40 | 0.44 | 1.10 | 1.37 | 1.66 | 3.25 |
| Total Volume | 0 | 1 | 850644.01 | 3453545.36 | 84.56 | 10838.58 | 107376.76 | 432962.29 | 62505646.52 |
| 4046 | 0 | 1 | 293008.42 | 1264989.08 | 0.00 | 854.07 | 8645.30 | 111020.20 | 22743616.17 |
| 4225 | 0 | 1 | 295154.57 | 1204120.40 | 0.00 | 3008.78 | 29061.02 | 150206.86 | 20470572.61 |
| 4770 | 0 | 1 | 22839.74 | 107464.07 | 0.00 | 0.00 | 184.99 | 6243.42 | 2546439.11 |
| Total Bags | 0 | 1 | 239639.20 | 986242.40 | 0.00 | 5088.64 | 39743.83 | 110783.37 | 19373134.37 |
| Small Bags | 0 | 1 | 182194.69 | 746178.51 | 0.00 | 2849.42 | 26362.82 | 83337.67 | 13384586.80 |
| Large Bags | 0 | 1 | 54338.09 | 243965.96 | 0.00 | 127.47 | 2647.71 | 22029.25 | 5719096.61 |
| XLarge Bags | 0 | 1 | 3106.43 | 17692.89 | 0.00 | 0.00 | 0.00 | 132.50 | 551693.65 |
| year | 0 | 1 | 2016.15 | 0.94 | 2015.00 | 2015.00 | 2016.00 | 2017.00 | 2018.00 |
The data set is quite complete without any missing values. The columns n_missing and complete_rate indicate that all variables have 0 missing value and achieve 1 (100%) complete rate.
There are no white spaces to trim as well, by examining the column whitespace.
We can clearly see that in the middle section, Variable type: Date, shows that this data set was recorded from 2015-01-04 to 2018-03-25.
Some important data problems have been identified.
Showing the first 3 rows of these columns:
avocado[1:3, 4:11]
Clich this Link to the relevant discussion page:
Try the calculation on the above first row
4046 + 4225 + 4770 + Small Bags + Large Bags + XLarge Bag (Notice: Total Bags is not included.)
= 1036.74 + 54454.85 + 48.16 + 8603.62 + 93.25 + 0
=
1036.74 + 54454.85 + 48.16 + 8603.62 + 93.25 + 0
## [1] 64236.62
And it matches the value in the first row of Total Volume: 64236.62.
It should not be the case because bag columns should be in the unit of bags, and they should be converted into actual number of avocado sold before taking part into that addition. Quite strangely, I am not able to find the information regarding how many avocado in each size category of bag and thus I am not able to do the conversion.
So, I would believe that the values in the columns of Total, Small, Large, and XLarge Bags should be the actual amount of avocado sold already, instead of in the unit of bag. The data supplier may have helped us done the the conversion.
These problems can be easily solved by,
And removing:
These activities will be performed in the data cleaning section.
I will examine the context of the data set on following aspects:
In short, this data is moderately reliable, not original, less comprehensive, cited and not current. Therefore, these elements would affect the reliability of the results from this analysis.
This is a standard naming convention in data industry to boost code readability and to avoid potential system errors due to spaces in the strings (character data) when using a programming language for analysis. For example, converting “Total Volume” into “total_volume” is often the standard approach.
Click the right button.
# The dataset is now called avocado2 and being worked on, I am preserving the original data set.
avocado2 <- avocado
# to lower case
names(avocado2) <- tolower(names(avocado2))
# to snake case
avocado2 <- avocado2 %>%
rename("average_price" = "averageprice",
"total_volume" = "total volume",
"total_bags" = "total bags",
"amount_from_small_bags" = "small bags",
"amount_from_large_bags" = "large bags",
"amount_from_xlarge_bags" = "xlarge bags")
# check out the column names
I use these codes to convert “averageprice”, “total volume”, “total bags”, “small bags”, “large bags”, and “xlarge bags” into lower and snake case “average_price”, “total_volume”, “total_bags”, “amount_from_small_bags”, “amount_from_large_bags”, and “amount_from_xlarge_bags”.
Checking these conversion using the first 3 rows:
avocado2[1:3,]
This is also to reduce potential errors during data manipulation. They may be recognised as numbers during analysis instead of being recognised as column names.
Following codes complete the conversion. Click the right code button.
avocado2 <- avocado2 %>%
rename("PLU4046" = "4046",
"PLU4225" = "4225",
"PLU4770" = "4770")
Removing “x1”, “total_volume”, and “total_bags”.
Recall, I am removing “total_volume”, and “total_bags” because their result is faulty as values within these associated columns are faulty. They should be integer but floating numbers were introduced - see section 5.1.
Using following codes (click right):
avocado2 <- avocado2 %>%
select(-x1, -total_volume, -total_bags)
Now I have them removed. The number of column has now been reduced from 14 columns to 11 columns.
avocado2[1:3,]
This section removes erroneous decimal places in “PLU4046”, “PLU4225”, “PLU4770”, “amount_from_small_bags”, “amount_from_large_bags”, and “amount_from_xlarge_bags” by rounding up.
avocado2 <- avocado2 %>%
mutate(PLU4046 = round(PLU4046),
PLU4225 = round(PLU4225),
PLU4770 = round(PLU4770),
amount_from_small_bags = round(amount_from_small_bags),
amount_from_large_bags = round(amount_from_large_bags),
amount_from_xlarge_bags = round(amount_from_xlarge_bags))
Converting the variables “Type”, “Year”, and “region” from their data type (denoted as “chr”, “dbl”, “chr”) into factor type (will be denoted as “fctr”), because they have categorical feature, by which they can be used to group the data during analysis, and be ordered into certain sequence of preference.
More importantly, their levels (elements within them) can be extracted using code and be inspected for errors such as misspelling.
Codes to complete the conversion, click right button:
avocado2 <- avocado2 %>%
mutate_if(is.character, factor) %>%
mutate(year = factor(year))
Checking is there errors within these variables, click the right button:
avocado2 %>%
select(type, year, region) %>%
sapply(levels)
## $type
## [1] "conventional" "organic"
##
## $year
## [1] "2015" "2016" "2017" "2018"
##
## $region
## [1] "Albany" "Atlanta" "BaltimoreWashington"
## [4] "Boise" "Boston" "BuffaloRochester"
## [7] "California" "Charlotte" "Chicago"
## [10] "CincinnatiDayton" "Columbus" "DallasFtWorth"
## [13] "Denver" "Detroit" "GrandRapids"
## [16] "GreatLakes" "HarrisburgScranton" "HartfordSpringfield"
## [19] "Houston" "Indianapolis" "Jacksonville"
## [22] "LasVegas" "LosAngeles" "Louisville"
## [25] "MiamiFtLauderdale" "Midsouth" "Nashville"
## [28] "NewOrleansMobile" "NewYork" "Northeast"
## [31] "NorthernNewEngland" "Orlando" "Philadelphia"
## [34] "PhoenixTucson" "Pittsburgh" "Plains"
## [37] "Portland" "RaleighGreensboro" "RichmondNorfolk"
## [40] "Roanoke" "Sacramento" "SanDiego"
## [43] "SanFrancisco" "Seattle" "SouthCarolina"
## [46] "SouthCentral" "Southeast" "Spokane"
## [49] "StLouis" "Syracuse" "Tampa"
## [52] "TotalUS" "West" "WestTexNewMexico"
During inspection, I found there are no misspelling, spaces, or typo, and I am good to proceed.
The “region” column from the data set is actually a faulty mix of cities, states and generic location such as “West” and “Northeast”. It is quite a mess, but it is why I am here for. I will create a new column located before the “region” column and having it to group cities, stats and generic location, or something else if necessary.
avocado2 <- avocado2 %>%
mutate(region_grouping = fct_collapse(region,
"state" = c("California", "SouthCarolina"),
"city" = c("Albany", "Atlanta", "BaltimoreWashington", "Boise", "Boston", "BuffaloRochester", "Charlotte", "Chicago", "CincinnatiDayton", "Columbus", "DallasFtWorth", "Denver", "Detroit", "GrandRapids", "GreatLakes", "HarrisburgScranton", "HartfordSpringfield", "Houston", "Indianapolis", "Jacksonville", "LasVegas", "LosAngeles", "Louisville", "MiamiFtLauderdale", "Nashville", "NewOrleansMobile", "NewYork", "Orlando", "Philadelphia", "PhoenixTucson", "Pittsburgh", "Plains", "Portland", "RaleighGreensboro", "RichmondNorfolk", "Roanoke", "Sacramento", "SanDiego", "SanFrancisco", "Seattle", "Spokane", "StLouis", "Syracuse", "Tampa", "WestTexNewMexico"),
"regional" = c("Midsouth", "Northeast", "NorthernNewEngland", "SouthCentral", "Southeast", "West"),
"Total" = c("TotalUS")
)) %>%
relocate(region_grouping, .before = region)
Please note: “BaltimoreWashington” is actually Baltimore in Maryland not Washington. It might be a typo.
avocado2 %>%
select(region_grouping, region) %>%
group_by(region_grouping) %>%
summarise("region" = paste(unique(region), collapse = ", ")) %>%
kbl(caption = "Doing my best to group US cities, states, and regional category based on information searched from Google Map (Finger-crossing)") %>%
kable_styling(bootstrap_options = c("border", "hover"))
| region_grouping | region |
|---|---|
| city | Albany, Atlanta, BaltimoreWashington, Boise, Boston, BuffaloRochester, Charlotte, Chicago, CincinnatiDayton, Columbus, DallasFtWorth, Denver, Detroit, GrandRapids, GreatLakes, HarrisburgScranton, HartfordSpringfield, Houston, Indianapolis, Jacksonville, LasVegas, LosAngeles, Louisville, MiamiFtLauderdale, Nashville, NewOrleansMobile, NewYork, Orlando, Philadelphia, PhoenixTucson, Pittsburgh, Plains, Portland, RaleighGreensboro, RichmondNorfolk, Roanoke, Sacramento, SanDiego, SanFrancisco, Seattle, Spokane, StLouis, Syracuse, Tampa, WestTexNewMexico |
| state | California, SouthCarolina |
| regional | Midsouth, Northeast, NorthernNewEngland, SouthCentral, Southeast, West |
| Total | TotalUS |
Calculating the total of consumption by adding PLU4046, PLU4225, PLU4770, amount_from_small_bags, amount_from_large_bags, amount_from_large_bags.
avocado2 <- avocado2 %>%
mutate(total_consumption = PLU4046 + PLU4225 + PLU4770 + amount_from_small_bags + amount_from_large_bags + amount_from_xlarge_bags)
Following graph shows the average prices of avocado in each city farmed by conventional method. Click the right button to see codes.
avocado2 %>%
filter(region_grouping == "city", type == "conventional") %>%
select(date, region, average_price) %>%
ggplot(aes(x = reorder(region, -average_price, na.rm = T), y = average_price)) +
geom_jitter(aes(colour = region, alpha = 0.5)) +
geom_violin(outlier.shape = NA, alpha = 0.5, size = 1) +
geom_hline(yintercept = 1.5, linetype = 2) +
geom_hline(yintercept = 1, linetype = 2) +
annotate("rect", xmin = "LosAngeles", xmax = "PhoenixTucson", ymin = -Inf, ymax = Inf, alpha = 0.2) +
geom_text(x = "WestTexNewMexico", y = 2.5, label = "My top 5 cities!", hjust = 0.5) +
stat_summary(fun = "mean") +
labs(x = "US City",
y = "Avocado prices",
caption = "From 2015-01-04 to 2018-03-25 (1176 days)",
title = "Figure 1. Violin plot of Non-organic Avocado Prices",
subtitle = "Visual aids: (1) Black dots are average price of individual avocado by city between Jan-2015 to Mar-2018, (2) the plot has been ordered descendingly, (3) Body of violin become fatter when data points increase.") +
theme_classic() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 25, vjust = 0.65),
plot.title = element_text(face = "bold", size = 15)) +
scale_y_continuous(lim = c(0, 3), breaks = seq(0, 3, 0.5))
My top 5 cities of preference for non-organic avocado are:
avocado2 %>%
filter(region_grouping == "city", type == "conventional") %>%
select(date, region, average_price) %>%
group_by(region) %>%
summarise(Average_price_per_avocado = paste0("$ ", round(mean(average_price), 2)),
Standard.deviation = paste0("$ ", round(sd(average_price), 2)),
max = paste0("$ ", round(max(average_price), 2)),
min = paste0("$ ", round(min(average_price), 2)),
median = paste0("$ ", round(median(average_price), 2))) %>%
arrange(Average_price_per_avocado) %>%
slice(1:5) %>%
mutate(No. = row_number()) %>%
relocate(No., .before = region) %>%
rename(city = region) %>%
kbl(align = "c",
table.attr = "style = 'widhth = 50;'") %>%
kable_classic_2("hover")
| No. | city | Average_price_per_avocado | Standard.deviation | max | min | median |
|---|---|---|---|---|---|---|
| 1 | PhoenixTucson | $ 0.73 | $ 0.18 | $ 1.31 | $ 0.46 | $ 0.68 |
| 2 | Houston | $ 0.83 | $ 0.15 | $ 1.29 | $ 0.51 | $ 0.79 |
| 3 | WestTexNewMexico | $ 0.84 | $ 0.13 | $ 1.22 | $ 0.52 | $ 0.82 |
| 4 | DallasFtWorth | $ 0.85 | $ 0.14 | $ 1.3 | $ 0.65 | $ 0.82 |
| 5 | LosAngeles | $ 0.98 | $ 0.24 | $ 1.8 | $ 0.53 | $ 0.94 |
Looking good, there are Houston and Los Angeles. Good cities, near coastal, may be good for career opportunities as well.
Following graph shows the average prices of avocado in each city farmed by organic method. Click the right button to see codes.
avocado2 %>%
filter(region_grouping == "city", type == "organic") %>%
select(date, region, average_price) %>%
ggplot(aes(x = reorder(region, -average_price, na.rm = T), y = average_price)) +
geom_jitter(aes(colour = region, alpha = 0.5)) +
geom_violin(outlier.shape = NA, alpha = 0.5, size = 1) +
geom_hline(yintercept = 1.5, linetype = 2) +
geom_hline(yintercept = 1, linetype = 2) +
annotate("rect", xmin = "CincinnatiDayton", xmax = "Houston", ymin = -Inf, ymax = Inf, alpha = 0.2) +
geom_text(x = "Denver", y = 2.7, label = "My top 5 cities!", hjust = 0.5) +
stat_summary(fun = "mean") +
labs(x = "US City",
y = "Avocado prices",
caption = "From 2015-01-04 to 2018-03-25 (1176 days)",
title = "Figure 2. Violin plot of Organic Avocado Prices",
subtitle = "Visual aids: (1) Black dots are average price of individual avocado by city between Jan-2015 to Mar-2018, (2) the plot has been ordered descendingly, (3) Body of violin become fatter when data points increase.") +
theme_classic() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 25, vjust = 0.65),
plot.title = element_text(face = "bold", size = 15)) +
scale_y_continuous(lim = c(0, 3), breaks = seq(0, 3, 0.5))
My top 5 cities of preference for organic avocado:
avocado2 %>%
filter(region_grouping == "city", type == "organic") %>%
select(date, region, average_price) %>%
group_by(region) %>%
summarise(Average_price_per_avocado = paste0("$ ", round(mean(average_price), 2)),
Standard.deviation = paste0("$ ", round(sd(average_price), 2)),
max = paste0("$ ", round(max(average_price), 2)),
min = paste0("$ ", round(min(average_price), 2)),
median = paste0("$ ", round(median(average_price), 2))) %>%
arrange(Average_price_per_avocado) %>%
slice(1:5) %>%
mutate(No. = row_number()) %>%
relocate(No., .before = region) %>%
rename(city = region) %>%
kbl(align = "c",
table.attr = "style = 'width: 20;'") %>%
kable_classic_2("hover")
| No. | city | Average_price_per_avocado | Standard.deviation | max | min | median |
|---|---|---|---|---|---|---|
| 1 | Houston | $ 1.27 | $ 0.25 | $ 1.92 | $ 0.81 | $ 1.26 |
| 2 | DallasFtWorth | $ 1.32 | $ 0.2 | $ 1.9 | $ 0.86 | $ 1.35 |
| 3 | Denver | $ 1.36 | $ 0.35 | $ 2.16 | $ 0.66 | $ 1.39 |
| 4 | CincinnatiDayton | $ 1.4 | $ 0.34 | $ 2.2 | $ 0.44 | $ 1.42 |
| 5 | Roanoke | $ 1.4 | $ 0.3 | $ 2.27 | $ 0.7 | $ 1.44 |
Looking good, Houston again, ranked 1 now and Dallas ranked number 2.
This section shows Houston might be the best city I should move to. Even if I consume organic avocado at the maximum price of $1.92 whole year round with a minimum of 1 avocado per day, the maximum amount of spending will be:
1.92*365
## [1] 700.8
It is still less than my budget of $800 / year.
However, I will carry out prediction later to ensure the price remains the similar for the next 24 months, hopefully.
Ii might be good to learn some avocado knowledge before going to the US, especially to find out which cities consume the most.
Following graph shows:
# set up the df
df7.2 <- avocado2 %>%
select(type, region_grouping, region, total_consumption) %>%
filter(region_grouping == "city") %>%
group_by(type, region) %>%
summarise(consumption = sum(total_consumption)) %>%
group_by(type) %>%
mutate(per = round(consumption/sum(consumption), 3)) %>%
mutate(type = fct_recode(type,
"Non-organic Avocado" = "conventional",
"Organic Avocado" = "organic"))
# plot the graph
ggplot(df7.2, aes(area = consumption)) +
geom_treemap(aes(fill = type)) +
geom_treemap_text(aes(label = paste(region, "\n", prettyNum(consumption, big.mark = ","), "\n", per*100, "%")), place = "center",
colour = "white") +
labs(title = "Figure 3. Total Avocado Consumption in 45 US cities using Tree Map",
caption = "Total consumption From 2015-01-04 to 2018-03-25 (1176 days)") +
theme(legend.position = "none",
strip.text = element_text(size = 15),
strip.background = element_rect(fill = "white"),
plot.title = element_text(size = 20, face = "bold", hjust= 0.5, vjust = 2)) +
facet_wrap(~type)
Houston has a 4.5% consumption of avocado, though it is 3 times smaller than great Lakes but it has nearly 200 millions of consumption for nearly 3 years, it is tremendous. I am strongly confident that I would find my avocado fellows in the city.
df7.3.1 <- avocado2 %>%
filter(region_grouping == "city") %>%
group_by(type) %>%
summarise(overall_consumption = sum(total_consumption)) %>%
mutate(per = round(overall_consumption/sum(overall_consumption)*100, 2),
per = paste0(per, "%")) %>%
mutate(type = fct_recode(type,
"Organic" = "organic",
"Non-Organic" = "conventional"))
ggplot(df7.3.1, aes(x = "", y = overall_consumption, fill = type)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(type, "\n", prettyNum(overall_consumption, big.mark = ","), "\n", per)),
position = position_stack(vjust = 0.5)) +
coord_polar(theta = "y",
start = 0) +
theme_minimal() +
theme(legend.position = "none",
axis.title = element_blank(),
axis.text = element_blank(),
plot.title = element_text(size = 11, hjust = 0.5, face = "bold")) +
labs(title = "Figure 4. 97% of Avocado were Cosumed as Inorganic and 3% were Consumed as Organic",
caption = "From 2015-01-04 to 2018-03-25 (1176 days)") +
scale_fill_brewer(palette = "Set3")
Drawing a quick bar chart to see how are the consumption different in the top 10 cities for two type of farming system.
df7.3.2 <- avocado2 %>%
select(type, region_grouping, region, total_consumption) %>%
filter(region_grouping == "city") %>%
group_by(type, region) %>%
summarise(consumption = sum(total_consumption)) %>%
group_by(type) %>%
mutate(per = round(consumption/sum(consumption), 3)) %>%
mutate(type = fct_recode(type,
"Non-organic Avocado" = "conventional",
"Organic Avocado" = "organic"))
df7.3.2 <- df7.3.2 %>%
group_by(type) %>%
arrange(desc(consumption)) %>%
slice(1:10)
ggplot(df7.3.2, aes(x = reorder(region, -consumption), y = consumption, fill = type)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~type, scale = "free_x") +
theme_bw() +
theme(legend.position = "none" ,
strip.text = element_text(size = 16),
plot.title = element_text(face = "bold", size = 20),
axis.text.x = element_text(angle = 25, vjust = 0.65, size = 12),
axis.title = element_text(size = 16),
axis.title.y = element_text(margin = margin(0, 15, 0, 0))) +
labs(x = "US City",
y = "Avocado consumption, per million",
title = "Figure 5. Top 10 US cities Avocado Consumption by Bar Chart",
subtitle = "Consumption of non-organic Avocado is way more than organic Avocado at even 25 times differet in GreatLakes",
caption = "Total consumption From 2015-01-04 to 2018-03-25 (1176 days)") +
scale_y_continuous(lim = c(0, 600000000),
breaks = seq(0, 600000000, 50000000),
labels = function(x){x/1000000}) +
geom_text(aes(label = paste0(round(consumption/1000000), " m")),
vjust = -0.6) +
geom_vline(xintercept = "Houston", linetype = 2, alpha = 0.5, colour = "grey5")
It will be interesting to see how are my fellow avocado lovers buying their fruits in the US, and especially in the Houston.
From the data set, there are variables categorising how avocados were sold, such as PLU4046 which stands for small or medium Hass avocado, PLU4225 stands for large Hass avocado, PLU4770 stands for extra large Hass avocado, and avocados that sold in small, large and xlarge bags.
Recall: Although the small, large and xlarge bags columns have their names suggest that the values within them have a unit of “per bag”, however the unit is actually per avocado, not in the unit of bag.- see section 5.
# df preparation and data manipulation for pivot_longer
df7.4 <- avocado2 %>%
filter(region_grouping == "city") %>%
select(type, region, PLU4046, PLU4225, PLU4770, amount_from_small_bags, amount_from_large_bags, amount_from_xlarge_bags) %>%
pivot_longer(c("PLU4046", "PLU4225", "PLU4770", "amount_from_small_bags", "amount_from_large_bags", "amount_from_xlarge_bags"),
names_to = "category",
values_to = "consumption_per_day")
df7.4.2 <- df7.4 %>%
group_by(type, category) %>%
summarise(Total = sum(consumption_per_day)) %>%
mutate(category = factor(category,
levels = c("PLU4046", "PLU4225", "PLU4770", "amount_from_small_bags", "amount_from_large_bags", "amount_from_xlarge_bags")),
category = fct_recode(category,
"large_bags" = "amount_from_large_bags",
"small_bags" = "amount_from_small_bags",
"Xlarge_bags" = "amount_from_xlarge_bags"),
type = fct_recode(type,
"Non-organic Avocado" = "conventional",
"Organic Avocado" = "organic")) %>%
group_by(type) %>%
mutate(per = paste0(round(Total/sum(Total)*100, 2), "%"))
ggplot(df7.4.2, aes(x = "", y = Total, fill = category)) +
geom_bar(stat = "identity") +
coord_polar(theta = "y", start = 0) +
geom_text(aes(label = paste0(category, "\n",
prettyNum(Total, big.mark = ","), "\n",
per)),
position = position_stack(vjust = 0.5),
colour = "black") +
theme_minimal() +
theme(legend.position = "none",
axis.title = element_blank(),
strip.text = element_text(size = 15),
plot.title = element_text(size = 22, face = "bold", vjust = 1,
hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
axis.text = element_blank()) +
facet_wrap(~type, switch = "x") +
labs(title = "Figure 6. Pie chart comparing the ways all avocados sold in the US",
subtitle = "Total consumption From 2015-01-04 to 2018-03-25 (1176 days)")
US people loves buying PLU4046 (32.2%) and PLU4225 (36%) non-organic avocado, they stands for “small or medium” size avocado and “large size” avocado. Not many Americans like extra large size avocado (PLU4770) with only 3.47%. For packaging size, people in the US love buying avocados packed in small bags at 21.69%, but not really prefer large and extra large packages.
See the relevant statistics:
df7.4.2 %>%
filter(type == "Non-organic Avocado") %>%
arrange(desc(Total)) %>%
kbl(align = "c",
caption = "Overall consumption of Non-organic Avocado in the US") %>%
kable_styling(bootstrap_options = c("hover", "bordered"))
| type | category | Total | per |
|---|---|---|---|
| Non-organic Avocado | PLU4225 | 1594976994 | 36.03% |
| Non-organic Avocado | PLU4046 | 1425379408 | 32.2% |
| Non-organic Avocado | small_bags | 959950034 | 21.69% |
| Non-organic Avocado | large_bags | 272024659 | 6.15% |
| Non-organic Avocado | PLU4770 | 153394014 | 3.47% |
| Non-organic Avocado | Xlarge_bags | 20943726 | 0.47% |
Above pie chart shows how small the consumption of organic avocado is as compared to non-organic avocado. The pies in the pie chart of the organic avocado might be too small to view, following pie chart is an extension of that chart.
ggplot(df7.4.2 %>% filter(type == "Organic Avocado"), aes(x = "", y = Total, fill = category)) +
geom_bar(stat = "identity") +
coord_polar(theta = "y", start = 0) +
geom_text(aes(label = paste0(category, "\n",
prettyNum(Total, big.mark = ","), "\n",
per)),
position = position_stack(vjust = 0.5),
colour = "black") +
theme_minimal() +
theme(legend.position = "none",
axis.text = element_blank(),
axis.title = element_blank(),
plot.title = element_text(face = "bold", size = 15),
strip.text = element_text(size = 14),
plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"),
plot.subtitle = element_text(hjust = 0.5)) +
facet_wrap(~ type, switch = "x") +
labs(title = "Figure 7. Pie chart comparing the ways Organic avocados sold in the US",
subtitle = "Total consumption From 2015-01-04 to 2018-03-25 (1176 days)")
df7.4.2 %>%
filter(type == "Organic Avocado") %>%
group_by(type) %>%
arrange(desc(Total), .by_group = TRUE) %>%
kbl(align = "c",
caption = "Overall consumption of Organic Avocado in the US") %>%
kable_styling(bootstrap_options = c("hover", "bordered"))
| type | category | Total | per |
|---|---|---|---|
| Organic Avocado | PLU4225 | 49627505 | 36.55% |
| Organic Avocado | small_bags | 45156090 | 33.26% |
| Organic Avocado | large_bags | 22261124 | 16.4% |
| Organic Avocado | PLU4046 | 17973594 | 13.24% |
| Organic Avocado | PLU4770 | 754630 | 0.56% |
| Organic Avocado | Xlarge_bags | 3214 | 0% |
In the organic avocado industry of the US, PLU4225 Avocado (small and medium size) is also the most popular avocados, extra large avocado size (PLU4770) and extra large bags are still the least preferred option, it is similar to the non-organic avocado industry. Organic consumers also love their avocado packed in small bags and large bags as compared to non-organic consumer.
According to figure 4 in section 7.3, 98% of avocado consumers in Houston consumed non-organic avocado, and only 2% consume organic avocado, between 2015-01-04 to 2018-03-25. Among the 98% non-organic avocado consumers, about 50% of them prefer small and medium size avocado (PLU4046), about 24% of them prefer large medium size avocado (PLU4225), and followed by 15.2% preference of avocados packed in small bags.
df7.5 <- df7.4 %>%
filter(region == "Houston") %>%
group_by(type, category) %>%
summarise(Total = sum(consumption_per_day)) %>%
mutate(category = factor(category,
levels = c("PLU4046", "PLU4225", "PLU4770", "amount_from_small_bags", "amount_from_large_bags", "amount_from_xlarge_bags")),
category = fct_recode(category,
"large_bags" = "amount_from_large_bags",
"small_bags" = "amount_from_small_bags",
"Xlarge_bags" = "amount_from_xlarge_bags"),
type = fct_recode(type,
"Non-organic Avocado" = "conventional",
"Organic Avocado" = "organic")) %>%
group_by(type) %>%
mutate(per = paste0(round(Total/sum(Total)*100, 2), "%"))
ggplot(df7.5, aes(x = "", y = Total, fill = category)) +
geom_bar(stat = "identity") +
coord_polar(theta = "y",
start = 0) +
geom_text(aes(label = paste0(category, "\n",
prettyNum(Total, big.mark = ","), "\n",
per)),
position = position_stack(vjust = 0.5),
colour = "black") +
facet_wrap(~ type, switch = "x") +
theme_minimal() +
scale_fill_brewer(palette = "OrRd") +
theme(legend.position = "none",
axis.title = element_blank(),
strip.text = element_text(size = 20),
plot.title = element_text(size = 22, face = "bold", vjust = 1,
hjust = 0.5),
plot.subtitle = element_text(size = 15, hjust = 0.5),
axis.text = element_blank()) +
labs(title = "Figure 8. Pie chart comparing the ways all avocados sold in the Houston",
subtitle = "Total consumption From 2015-01-04 to 2018-03-25 (1176 days)")
df7.5 %>%
filter(type == "Non-organic Avocado") %>%
arrange(desc(Total)) %>%
kbl(align = "c",
caption = "Non-organic Avocado consumption with selling methods in Houston") %>%
kable_styling(bootstrap_options = c("hover", "bordered"))
| type | category | Total | per |
|---|---|---|---|
| Non-organic Avocado | PLU4046 | 98198811 | 49.3% |
| Non-organic Avocado | PLU4225 | 47696206 | 23.94% |
| Non-organic Avocado | small_bags | 30272429 | 15.2% |
| Non-organic Avocado | large_bags | 17253877 | 8.66% |
| Non-organic Avocado | PLU4770 | 5455463 | 2.74% |
| Non-organic Avocado | Xlarge_bags | 326131 | 0.16% |
Whereas, among the 2% of organic consumers, they prefer their organic avocado packed in small bags (56.82%), then prefer small and medium size avocado (PLU4046), it was ranked the first preference for non-organic consumer. Again, extra large size avocado and extra large packing-bags are the least preferred option.
ggplot(df7.5 %>% filter(type == "Organic Avocado"), aes(x = "", y = Total, fill = category)) +
geom_bar(stat = "identity") +
coord_polar(theta = "y",
start = 0) +
geom_text(aes(label = paste0(category, "\n",
prettyNum(Total, big.mark = ","), "\n",
per)),
position = position_stack(vjust = 0.5),
colour = "black") +
facet_wrap(~ type) +
theme_minimal() +
scale_fill_brewer(palette = "OrRd") +
theme(legend.position = "none",
axis.title = element_blank(),
strip.text = element_text(size = 20),
plot.title = element_text(size = 22, face = "bold", vjust = 1,
hjust = 0.5),
plot.subtitle = element_text(size = 15, hjust = 0.5),
axis.text = element_blank()) +
facet_wrap(~ type, switch = "x") +
labs(title = "Figure 9. Pie chart comparing the ways Organic avocados sold in the Houston",
subtitle = "Total consumption From 2015-01-04 to 2018-03-25 (1176 days)")
df7.5 %>%
filter(type == "Organic Avocado") %>%
arrange(desc(Total)) %>%
kbl(align = "c",
caption = "Organic Avocado consumption with selling methods in Houston") %>%
kable_styling(bootstrap_options = c("hover", "bordered"))
| type | category | Total | per |
|---|---|---|---|
| Organic Avocado | small_bags | 2252737 | 56.82% |
| Organic Avocado | PLU4046 | 1574086 | 39.7% |
| Organic Avocado | large_bags | 109883 | 2.77% |
| Organic Avocado | PLU4225 | 28229 | 0.71% |
| Organic Avocado | PLU4770 | 9 | 0% |
| Organic Avocado | Xlarge_bags | 0 | 0% |
This section I will first predict how are avocados prices in the US then followed by Houston. If predict prices in Houston exceed my budget, I will continue to search and predict other cities.
This section I will predict:
This time plot shows the prices of non-organic avocado was relatively constant between 2015 and 2016, but followed by big fluctuations in 2016 and 2017. The graph seems not stationary and had a strong trend of increment of price per avocado over time though it dropped drastically in the end of 2017 and stabilised in the start of 2018. Seasonality is definitely present in the August, September, and October of 2016 and 2017.
conv_us <- avocado2 %>%
filter(region_grouping == "city",
type == "conventional") %>%
group_by(date) %>%
summarise(average_price = mean(average_price))
# set up ts
conv_us_ts <- ts(conv_us$average_price,
start = c(2015, 1),
frequency = 52)
# plot
autoplot(conv_us_ts) +
labs(title = "Time plot: non-organic avocado weekly prices in the US",
subtitle = "Averaged from all cities",
y = "$") +
geom_point(colour = "brown", shape = 21) +
geom_path(colour = "brown")
The differencing plot below shows steep changes of prices near the beginning and end of 2017, suggesting there may be a seasonality.
conv_us_ts_d <- diff(conv_us_ts)
autoplot(conv_us_ts_d) + geom_point(colour = "brown", shape = 21) +
labs(title = "Differencing plot: Change in weekly prices of non-organic avocado in the US",
y = "$")
Apart from visualising stationarity and seasonality, I can apply some statistical tests that give P-values to aid my observation:
PP.test(conv_us_ts)
##
## Phillips-Perron Unit Root Test
##
## data: conv_us_ts
## Dickey-Fuller = -3.006, Truncation lag parameter = 4, p-value = 0.1563
# Null hypothesis: Not stationary
# Fail to reject, so the object is not stationary.
summary(wo(conv_us_ts))
## Test used: WO
##
## Test statistic: 1
## P-value: 3.39749e-07 1.786835e-05 0.01781721
##
## The WO - test identifies seasonality
Looking at exponential smoothing models (ETS), I know this method cannot be used for frequency that is higher than 24. My frequency is 52 becasuse I am forecasting base on weekly data and there are 52 weeks for a year. Click the right button.
conv_us_ts_ETS <- ets(conv_us_ts)
## Warning in ets(conv_us_ts): I can't handle data with frequency greater than 24.
## Seasonality will be ignored. Try stlf() if you need seasonal forecasts.
I am now attempting to try out ARIMA models.
conv_us_ts_arima <- auto.arima(conv_us_ts,
d = 1,
stepwise = F,
approximation = F)
Ljung box test reveal a p-value of 0.7584, so I safely say there is no autocorrelation from this model and suggesting it is safe to proceed to forcasting.
Box.test(conv_us_ts_arima$residuals, lag = 52, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: conv_us_ts_arima$residuals
## X-squared = 44.561, df = 52, p-value = 0.7584
I can also visualise the autocorrelation by looking at the ACF-log plot. There is a little bit of autocorrelation left when Lag = 10. It should be fine as it is minor and supported by the result of Ljung box test. Additionallu, it is good to see that I have a normally distributed residuals.
checkresiduals(conv_us_ts_arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,0)(1,0,0)[52]
## Q* = 24.408, df = 33, p-value = 0.8605
##
## Model df: 1. Total lags used: 34
conv_us_ts_arima_fc <- forecast(conv_us_ts_arima, h = 104)
autoplot(conv_us_ts_arima_fc) +
geom_hline(yintercept = 2, linetype = 2, colour = "blue") +
labs(subtitle = "Prediction of weekly prices of non-organic avocado prices in the US",
y = "$") +
geom_text(x = ymd(2018-01-01), y = 2.1, label = "hi")
summary(conv_us_ts) # or accuracy(organic_AA)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.8296 1.0511 1.1209 1.1636 1.2738 1.7233
Applying the same concepts as in the previous section.
org_us <- avocado2 %>%
filter(region_grouping == "city",
type == "organic") %>%
group_by(date) %>%
summarise(average_price = mean(average_price))
# set up ts
org_us_ts <- ts(org_us$average_price,
start = c(2015, 1),
frequency = 52)
# plot
autoplot(org_us_ts) +
labs(title = "Time plot: organic avocado weekly prices in the US",
subtitle = "Averaged from all cities",
y = "$") +
geom_point(colour = "darkgreen", shape = 21) +
geom_path(colour = "darkgreen")
PP.test(org_us_ts)
##
## Phillips-Perron Unit Root Test
##
## data: org_us_ts
## Dickey-Fuller = -2.5455, Truncation lag parameter = 4, p-value = 0.3486
# Not stationary
summary(wo(org_us_ts))
## Test used: WO
##
## Test statistic: 0
## P-value: 1 1 0.7244975
##
## The WO - test does not identify seasonality
I will personally reject to follow this P-value as proved by my another plot (subseries plot) below, it shows possibility of seasonality. It might be masked by noises.
ggsubseriesplot(org_us_ts) +
geom_point(colour = "darkgreen", fill = "green", shape = 21, size = 3) +
labs(title = "Subseries plot: The changes of Yearly prices in each month",
y = "$",
subtitle = "Between week 34 to 43, most years seems to have slightly increase in prices and is suggesting a possible seasonality.") +
annotate("rect", xmin = 34, xmax = 43, ymin = -Inf, ymax = Inf, fill = "yellow", alpha = 0.1)
Hopefully, auto ARIMA should pick up the trend and seasonality and give a suitable model without autocorrelation.
org_us_ts_arima <- auto.arima(org_us_ts,
d = 1,
stepwise = F,
approximation = F)
There are autocorrelation exists in the ACF-Lag plots of the chosen ARIMA model and suggesting there be data remain unexplained if forecast is to be proceeded. LJung-box test also supporting this result with P-value less than 0.05.
checkresiduals(org_us_ts_arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,3)(0,1,0)[52]
## Q* = 100.59, df = 30, p-value = 1.497e-09
##
## Model df: 4. Total lags used: 34
Declaring in the forecast of the existence of autocorrelation during ACF-Lag test.
org_us_ts_arima_fc <- forecast(org_us_ts_arima, h = 104)
autoplot(org_us_ts_arima_fc) +
labs(subtitle = "Prediction of weekly prices of organic avocado in the US",
caption = "This is the best model of ARIMA but accuracy affected by autocorrelation.",
y = "$") +
geom_hline(yintercept = 2, linetype = 2, colour = "blue")
Applying the same concept as previous sections. Plotting the time plot first to observe trend stationary and seasonality. From observation, there are a lot of noises in the graph making seasonality vague, though there are a couple of obvious peaks between 2016 to 2017, and 2017 to 2018. There seems like a gradual positive trend though the price drops back to really low of near $0.55 near the end of the data trend.
conv_houston <- avocado2 %>%
filter(region == "Houston",
type == "conventional") %>%
group_by(date) %>%
summarise(average_price = mean(average_price))
# set up ts
conv_houston_ts <- ts(conv_houston$average_price,
start = c(2015, 1),
frequency = 52)
# plot
autoplot(conv_houston_ts) +
labs(title = "Time plot: non-organic avocado weekly prices in Houston",
y = "$") +
geom_point(colour = "brown", shape = 21) +
geom_path(colour = "brown")
I have a p-value from stationary test showing a result of higher than 0.05, and the test says the graph is not stationary.
PP.test(conv_houston_ts)
##
## Phillips-Perron Unit Root Test
##
## data: conv_houston_ts
## Dickey-Fuller = -3.2503, Truncation lag parameter = 4, p-value =
## 0.08177
There are too much noises in the graph, and the seasonality tests does not identify seasonality in the graph.
summary(wo(conv_houston_ts))
## Test used: WO
##
## Test statistic: 0
## P-value: 1 1 0.05514139
##
## The WO - test does not identify seasonality
Applying the auto_arima functions in R to choose the best model.
conv_houston_ts_arima <- auto.arima(conv_houston_ts,
d = 1,
approximation = F,
stepwise = F,
trace = T)
Testing the model for autocorrelation existence. The ACF-Lag graph shows that the model can explain most of the data in the dataset, only left with just a few, minor autocorrelation. This suggesting I am safe to proceed with forecasting.
checkresiduals(conv_houston_ts_arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,0)(0,0,1)[52]
## Q* = 31.227, df = 33, p-value = 0.5556
##
## Model df: 1. Total lags used: 34
Creating the forecast of non-organic avocado in the Houston for next 2 years.
conv_houston_ts_arima_fc <- forecast(conv_houston_ts_arima, h = 104)
autoplot(conv_houston_ts_arima_fc) + labs(subtitle = "Prediction of weekly prices of non-organic avocado in Houston",
y = "$") +
geom_hline(yintercept = 2, linetype = 2, colour = "blue")
The accuracy of this forecast is 92.97% (1 - MAPE)
summary(conv_houston_ts_arima)
## Series: conv_houston_ts
## ARIMA(0,1,0)(0,0,1)[52]
##
## Coefficients:
## sma1
## 0.1864
## s.e. 0.0965
##
## sigma^2 estimated as 0.006018: log likelihood=190.7
## AIC=-377.39 AICc=-377.32 BIC=-371.15
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.001014761 0.07711343 0.05503131 -0.6660946 7.025728 0.4368157
## ACF1
## Training set -0.04668654
Hopefully organic avocado has a good range of prices in next two years.
Time plot showing the overall weekly changes of avocado price from 2015-01-04 to 2018-03-25.
# set up df for this section
org_houston <- avocado2 %>%
filter(type == "organic",
region == "Houston") %>%
arrange(date)
# set up ts df
org_houston_ts <- ts(org_houston$average_price,
frequency = 52, # My data is weekly.
start = c(2015, 1))
#Plot
autoplot(org_houston_ts) +
labs(title = "Time plot: organic avocado weekly prices in Houston",
caption = "Total consumption From 2015-01-04 to 2018-03-25 (1176 days)",
y = "Price per Avocado ($)") +
geom_point(colour = "darkgreen", shape = 21) +
geom_path(colour = "darkgreen")
From the first year of our data, organic price of avocado in the first 6 months of 2015 was relatively constant, however, the price fluctuated drastically in the next 2 to 3 years.
Stationary test fails to reject the null hypothesis (P-value = 0.2311) and says the trend is not stationary.
PP.test(org_houston_ts)
##
## Phillips-Perron Unit Root Test
##
## data: org_houston_ts
## Dickey-Fuller = -2.8269, Truncation lag parameter = 4, p-value = 0.2311
There is no indication of seasonality from its P-value.
summary(wo(org_houston_ts))
## Test used: WO
##
## Test statistic: 0
## P-value: 1 1 0.7900466
##
## The WO - test does not identify seasonality
Exponential Smoothing models (ETS)
I am apply ETS modeling using following codes. However, this model can’t handle frequency greater than 24, I have a frequency of 52 because my data is weekly data. I will then choose another model for my forecasting.
org_houston_ts_ets <- ets(org_houston_ts)
## Warning in ets(org_houston_ts): I can't handle data with frequency greater than
## 24. Seasonality will be ignored. Try stlf() if you need seasonal forecasts.
ARIMA Model
Applying ARIMA model for my forecast.
org_houston_ts_arima <- auto.arima(org_houston_ts,
d = 1,
stepwise = F,
approximation = F,
trace = T)
Residuals-check shows that ACF-Lag has not much autocorrelation left, the model can explain most of the data and is using the data quite well. There is a normal distribution of residuals as well.
checkresiduals(org_houston_ts_arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,2)
## Q* = 31.381, df = 32, p-value = 0.4977
##
## Model df: 2. Total lags used: 34
Testing Ljung-Box test, p-value is greater than 0.05, then I safely say no autocorrelation is present. I can safely use this model for the purpose of forecasting.
Box.test(org_houston_ts_arima$residuals, lag = 52, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: org_houston_ts_arima$residuals
## X-squared = 42.259, df = 52, p-value = 0.8304
Synthesising the forecast graph.
org_houston_ts_arima_fc <- forecast(org_houston_ts_arima)
autoplot(org_houston_ts_arima_fc, h = 104) +
labs(subtitle = "Prediction of weekly prices of non-organic avocado in Houston",
y = "$") +
geom_vline(xintercept = 2019, linetype = 2, colour = "red") +
geom_hline(yintercept = 2, linetype = 2, colour = "blue") +
geom_text(x = 2.5, y = 2019)
This model has nearly 95% (94.9515) of accuracy (100 - MAPE).
summary(org_houston_ts_arima) # or accuracy(organic_AA)
## Series: org_houston_ts
## ARIMA(0,1,2)
##
## Coefficients:
## ma1 ma2
## 0.1212 -0.2221
## s.e. 0.0789 0.0830
##
## sigma^2 estimated as 0.01007: log likelihood=148.8
## AIC=-291.61 AICc=-291.46 BIC=-282.24
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0007205485 0.09945763 0.0629353 -0.301139 5.048496 0.1932152
## ACF1
## Training set 0.01370076
Bad news is that after 2019, price per organic avocado may exceed $2. And I might switch to non-organic avocado.
The best city for me to move to is Houston, I am buying organic avocados until the end of 2019, then switch to non-organic ones. These conditions fulfilled my desire of at least 1 avocado per day, not exceeding threshold of $800 per year, and in the same time being environmental friendlier.
This is a personal project created and designed for non-commercial use only. The data set of this personal project was introduced by Google data analytic certificate course and was made available for students for some analytics practices in the course. This is a public data set from Kaggle. Please visit Topic 4 in this project for the link. This project extends the analysis scope of the course. This personal project was not submitted and graded by Google. Google and I, nor any stakeholders relevant to the creation of this project held any responsibility for any outcomes affected from using the results of this project. This project is only for educational and skill demonstration purposes.
All photos in this project, such as those as the thumbnail are just for demonstration only, they have nothing to do with the data set, the location where the data were collected, and the results of this analysis.
the Hass Avocado Board n.d., How to Identify Hass Avocadoes, viewed 18 June 2021, 2021,https://loveonetoday.com/how-to/identify-hass-avocados/
Overview of Avocado Datasets n.d., viewed 18 June 2021, https://cran.r-project.org/web/packages/avocado/vignettes/a_intro.html
Russell M 2020, Avocado New Zealand avocado returns high despite COVID-19 affected season, viewed 18 June 2021, https://www.freshplaza.com/article/9233181/new-zealand-avocado-returns-high-despite-covid-19-affected-season/