This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
Note: this analysis was performed using the open source software R and Rstudio.
The objective of this applied project is to explain the price of avocados using some basic descriptive analysis.This analysis can be used by producers, retailers, and groceries to make decisions about their pricing strategies, advertising strategies, and supply chain stratgies among others. Some additional analysis will follow after this episode. Your feedback is highly appreciated.
This data was downloaded from the Hass Avocado Board website in May of 2018 & compiled into a single CSV. Here’s how the Hass Avocado Board describes the data on their website: The table below represents weekly retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.
data <- read.csv("avocado (1).csv")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(ggplot2)
library(magrittr)
str(data)
## 'data.frame': 12628 obs. of 7 variables:
## $ date : chr "2017/12/3" "2017/12/3" "2017/12/3" "2017/12/3" ...
## $ average_price: num 1.39 1.44 1.07 1.62 1.43 1.58 1.14 1.77 1.4 1.88 ...
## $ total_volume : int 139970 3577 504933 10609 658939 38754 86646 1829 488588 21338 ...
## $ type : chr "conventional" "organic" "conventional" "organic" ...
## $ year : int 2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
## $ geography : chr "Albany" "Albany" "Atlanta" "Atlanta" ...
## $ Mileage : int 2832 2832 2199 2199 2679 2679 827 827 2998 2998 ...
names(data) <- trimws(names(data))
if ("AveragePrice" %in% names(data)) {
names(data)[names(data) == "AveragePrice"] <- "average_price"
}
if ("average_price " %in% names(data)) {
names(data)[names(data) == "average_price "] <- "average_price"
}
data <- data %>%
mutate(average_price = readr::parse_number(as.character(average_price)))
str(data$average_price)
## num [1:12628] 1.39 1.44 1.07 1.62 1.43 1.58 1.14 1.77 1.4 1.88 ...
hist(data$average_price,
main = "Histogram of average_price",
xlab = "Price in USD")
ggplot(data, aes(x = average_price, fill = type)) +
geom_histogram(bins = 30, color = "red") +
ggtitle("Frequency of Average Price - Organic vs. Conventional")
# Simple EFA with ggplot
ggplot() +
geom_col(data, mapping = aes(x = reorder(geography,total_volume),
y = total_volume, fill = year )) +
xlab("geography")+
ylab("total_volume")+
theme(axis.text.x = element_text(angle = 90, size = 7))
# Sample response for year 2017 - The plot shows that Los Angels has the highest amount of sales in 2017.
set.seed(245566)
customer_data_samp <- dplyr::sample_frac(data, size = 0.1, replace = FALSE)
ggplot(customer_data_samp, aes(x = average_price, y = total_volume)) +
geom_point() +
ggtitle("Total sales as a function of average price")
ggplot(data, aes(x = average_price)) +
geom_bar()
ggplot(data, aes(x = as.factor(year), y = average_price)) +
geom_boxplot(fill = "red") +
labs(
title = "Distribution of average price by year",
x = "average_price",
y = "total_volume"
)