Project1

Author

Arif

Introduction:

For this project, I wanted to deviate from any serious topics and focus on something light. The dataset contains information about coffee, rice, and beef prices on the market within the last 30 years. I figured this dataset is somewhat relatable to what we are dealing today as we see prices skyrocketed.

The dataset contains 360 obs. of 5 variables: year from 1992-2022, month from January-December, prices of rice, coffee, and beef in kilo, inlfation rate, and inflated prices of rice, coffee, and beef. For this dataset, I wanted to see side by side the inflated prices for this products. I wanted to know which products overall has the highest inflation price within the last 30 years.

Here is the website where I obtain this dataset: https://www.kaggle.com/datasets/timmofeyy/-coffee-rice-and-beef-price-changes-for-30-years

To start

Set up a working directory to obtain the dataset by following the command below. read.csv command allow to load the data from a CSV file into the dataframe.

setwd("/Users/arifjadji/Desktop/DATA110/DATA110 Datasets")
rice_beef_coffee_price_changes<-read.csv("rice_beef_coffee_price_changes.csv")
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Clean up the data:

Make all headers lower case and remove spaces. Since there are 4 variables, we can use “summary” to decide which variables we want to focus on.

names(rice_beef_coffee_price_changes) <- tolower(names(rice_beef_coffee_price_changes))
names(rice_beef_coffee_price_changes) <- gsub(" ","",names(rice_beef_coffee_price_changes))
str(rice_beef_coffee_price_changes)

'data.frame':   360 obs. of  9 variables:
 $ year             : int  1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 ...
 $ month            : chr  "Feb" "Mar" "Apr" "May" ...
 $ price_beef_kilo  : num  2.52 2.45 2.38 2.38 2.33 2.35 2.45 2.44 2.5 2.54 ...
 $ price_rice_kilo  : num  0.28 0.28 0.28 0.27 0.27 0.28 0.27 0.26 0.25 0.25 ...
 $ price_coffee_kilo: num  1.51 1.55 1.45 1.34 1.3 1.28 1.17 1.17 1.36 1.48 ...
 $ inflation_rate   : num  89.6 89.6 89.6 89.6 89.6 ...
 $ price_rice_infl  : num  0.53 0.53 0.53 0.51 0.51 0.53 0.51 0.49 0.47 0.47 ...
 $ price_beef_infl  : num  4.78 4.64 4.51 4.51 4.42 4.46 4.64 4.63 4.74 4.82 ...
 $ price_coffee_infl: num  2.86 2.94 2.75 2.54 2.46 2.43 2.22 2.22 2.58 2.81 ...

#summary(rice_beef_coffee_price_changes)

Selection of Variables:

By using the command below we can select the variables we want to use. The select function is used to pick specific variables or features of a data frame or tibble. Here we are selecting ‘price_beef_infl, ’price_rice_infl’, and ‘price_coffee_infl’ and grouping them by month, year.

I wanted to focus on the inflation prices of beef, rice, and coffee and how it change throughout the past 30 years.

rice_beef_coffee_price_changes <- rice_beef_coffee_price_changes %>% 
  select(month, year, 'price_beef_infl', 'price_rice_infl', 'price_coffee_infl') %>%
  group_by(month, year)
head(rice_beef_coffee_price_changes)

# A tibble: 6 × 5
# Groups:   month, year [6]
  month  year price_beef_infl price_rice_infl price_coffee_infl
  <chr> <int>           <dbl>           <dbl>             <dbl>
1 Feb    1992            4.78            0.53              2.86
2 Mar    1992            4.64            0.53              2.94
3 Apr    1992            4.51            0.53              2.75
4 May    1992            4.51            0.51              2.54
5 Jun    1992            4.42            0.51              2.46
6 Jul    1992            4.46            0.53              2.43

Facet_Wrap

Use the package “tidyr” to convert the dataset from wide to long with the command “gather”. This will take each column (price_beef_infl, price_rice_infl, and price_coffee_infl) combine them all into one column called “id”. Then each cell count will go into the new column, “prices”. Finally, we are only doing this for the quantitiative variables, which are in columns 3 - 5. Note the command facet_wrap requires (~) before “id”. (Saidi, 2020)

Here we are looking at the price changes throughout the years. We noticed how the price of coffee and beef are significantly higher than the price of rice. The price of coffee seem to have the highest change; it never went back down below 2.5 after 2004-2005. Nonetheless, all three shows an increase in price.

rice_beef_coffee_price_changeslong <- rice_beef_coffee_price_changes %>% 
  tidyr::gather("id", "prices", 3:5) 

rice_beef_coffee_price_changesplot <-rice_beef_coffee_price_changeslong %>% 
  ggplot(., aes(year, prices))+
  geom_point()+
  aes(color = id)+
  facet_wrap(~id)
rice_beef_coffee_price_changesplot

Warning: Removed 39 rows containing missing values (`geom_point()`).

Focusing more into the prices.

From the facet_wrap plot above, we noticed the inflation rates of these three products. It seems that between 2007-2010 the prices of these products increases. The command below will allow us to filter out the inflation prices of these three products.

prices <- rice_beef_coffee_price_changeslong %>%
  filter( id== "price_coffee_infl" |id== "price_beef_infl" | id== "price_rice_infl")%>%
  group_by(year, month) %>%
  arrange(desc(prices))
prices

# A tibble: 1,080 × 4
# Groups:   year, month [360]
   month  year id                prices
   <chr> <int> <chr>              <dbl>
 1 May    1997 price_coffee_infl   9.76
 2 Sep    1994 price_coffee_infl   8.78
 3 Jul    1994 price_coffee_infl   8.67
 4 Jun    1997 price_coffee_infl   8.1 
 5 Oct    1994 price_coffee_infl   7.99
 6 Aug    1994 price_coffee_infl   7.93
 7 Apr    2011 price_coffee_infl   7.83
 8 Mar    2011 price_coffee_infl   7.62
 9 May    2011 price_coffee_infl   7.59
10 Apr    1997 price_coffee_infl   7.56
# ℹ 1,070 more rows

Plotting the price inflation of these three products:

To finalize the barplot use the syntax below: position = “dodge” makes side-by-side bars, stat = “identity” allows you to plot each set of bars for each year between 1992-2022, the ggtitle gives the plot a title, and lastly the labs gives a title to the legend (Saidi, 2020)

In this barplot, we can see more clearly how much these prices changes within the last 30 years. It looks like in the year 2006 the data colletion was distrupted, this may be the reason why we noticed a slight increase of prices for all of this products in the year 2007-2010.

plot <- prices %>%
  ggplot() +
  geom_bar(aes(x=year, y=prices, fill = id),
   position = "dodge", stat= "identity") +
  ggtitle("Price changes of beef, coffee, and rice from 1992-2022") +
  ylab("Inflation Rate") + 
  labs(fill = "Products")
plot

Warning: Removed 39 rows containing missing values (`geom_bar()`).

Loess Smoother

Another way to look at the correlation between the price inflation to the year is by creating a scatterplot. The command below will allow us to create a Loess Smoother. The ‘geom_point’ adds a layer of points to the plot, ‘alpha=’ controls the transparency of the dots, ‘geom_smooth’ adds a trendline over an existing plot.

This is showing that the prices of coffee, rice, and beef increases in relation to the year.

ggplot(prices, aes(year, prices)) +
  geom_point(aes(size = prices), alpha = 1/2) +
  geom_smooth() +
  scale_size_area()

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Warning: Removed 39 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 39 rows containing missing values (`geom_point()`).

What I could not get to work

I wanted to create a stream graphs or alluvials; however, I struggled to make it work. I tried to install Rtools40 but it does not seem to work in macbooks. Thank you.