First, we’ll load in our libraries. I’m using
library(here) on the advice of Jenny Bryan from her Project-oriented
workflow post on Tidyverse. This is just to organize my project into
a folder in order to keep it self-contained and portable.
# loading libraries
library(readr)
library(dplyr)
library(tidyverse)
library(here)
Next, we’ll load in the data. This comes from 4 tabs of a Google Sheets file on which I’ve recorded my candle research. Variables include things such as the price and weight of a candle, as well as how long it burned.
# loading data
brands <- read_csv("brands.csv")
burn_times <- read_csv("burn_times.csv")
materials <- read_csv("materials.csv")
purchases <- read_csv("purchases.csv")
Now that we’ve got our data, let’s do something with it! Using the
%>% pipe character from the dplyr library, let’s find
the mean time that I let each candle burn for during a session.
mean_session_times <- burn_times %>%
group_by(candle_id) %>%
summarize(mean = mean(session_time))
mean_session_times
## # A tibble: 12 × 2
## candle_id mean
## <dbl> <dbl>
## 1 1 3
## 2 2 3.49
## 3 3 2.96
## 4 4 2.18
## 5 5 3.47
## 6 6 3.39
## 7 7 4.09
## 8 8 4.09
## 9 9 4.74
## 10 10 4.79
## 11 11 5.38
## 12 12 3.16
Interestingly, it seems that the amount of time I let a candle burn
for during each session trended upward over time, though it didn’t
always increase from one candle to the next. Let’s use a bar graph to
visualize the data. I’m using the ggplot2 library to create
the visual.
viz <- ggplot(mean_session_times,aes(fill=as.factor(candle_id),x=candle_id,y=mean)) + geom_bar(stat = "identity", ,col="brown")
viz + labs(title="Average Session Times",subtitle="Data from Candlegraph",caption="The graph shows that I became less careful about keeping burn times down as time went on.",x="Candle ID",y="Hours",fill="Candle ID")
While the above graph is a good start, it doesn’t tell us the candle
names, just their IDs. This is because burn_times.csv and
the burn_times data frame that was created from it do not
have those names listed. But our purchases data frame does.
Both frames contain the common column candle_id, which
allows us to associate the candle_id with the candle name
(scent_name) using an inner join.
scents <- burn_times %>%
inner_join(purchases)
## Joining with `by = join_by(candle_id)`
scents
## # A tibble: 168 × 16
## candle_id start_time stop_time session_time start_temp stop_temp start_rh
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 2022-02-20 12… 2022-02-… 3 70.5 74.1 42.8
## 2 1 2022-02-21 12… 2022-02-… 3 75.5 82 50.4
## 3 1 2022-02-22 20… 2022-02-… 3 73.1 76.8 45.1
## 4 1 2022-02-23 21… 2022-02-… 2.4 67.2 68 37.2
## 5 1 2022-02-24 20… 2022-02-… 3.3 67.8 68.5 36
## 6 1 2022-02-28 11… 2022-02-… 2.8 66.6 75.4 36.1
## 7 1 2022-03-03 21… 2022-03-… 3.1 76.3 81.9 41
## 8 1 2022-03-07 20… 2022-03-… 3.2 73.3 77.1 41.3
## 9 1 2022-03-11 20… 2022-03-… 3.2 68.2 74.5 39.7
## 10 1 2022-03-15 20… 2022-03-… 3.2 75.3 82.7 41.4
## # ℹ 158 more rows
## # ℹ 9 more variables: stop_rh <dbl>, start_dp <dbl>, stop_dp <dbl>,
## # session_id <dbl>, brand_id <dbl>, brand_name <chr>, scent_name <chr>,
## # price_usd <dbl>, weight_oz <dbl>
Now we have the ID of the candle and its associated name (listed as
scent_name) in our data frame. For our purposes, we only
need the candle_id, session_time, and
scent_name columns, so let’s update scents to
include only those three column names, using the select
function.
scents <- scents %>%
select(candle_id,session_time,scent_name)
scents
## # A tibble: 168 × 3
## candle_id session_time scent_name
## <dbl> <dbl> <chr>
## 1 1 3 Slow Burn
## 2 1 3 Slow Burn
## 3 1 3 Slow Burn
## 4 1 2.4 Slow Burn
## 5 1 3.3 Slow Burn
## 6 1 2.8 Slow Burn
## 7 1 3.1 Slow Burn
## 8 1 3.2 Slow Burn
## 9 1 3.2 Slow Burn
## 10 1 3.2 Slow Burn
## # ℹ 158 more rows
Now that we have the session_time linked to the
scent_name, we can find the average (mean) time per session
for each scent.
mean_session_times <- scents %>%
group_by(candle_id) %>%
mutate(mean = mean(session_time)) %>%
slice(1)
mean_session_times
## # A tibble: 12 × 4
## # Groups: candle_id [12]
## candle_id session_time scent_name mean
## <dbl> <dbl> <chr> <dbl>
## 1 1 3 Slow Burn 3
## 2 2 4.4 Mentheverte 3.49
## 3 3 2.3 Cozy Cabin 2.96
## 4 4 2 Edition 02 - Shiso 2.18
## 5 5 4.8 Small Fires 3.47
## 6 6 3 Tobacco Toscano 3.39
## 7 7 4.1 34 Boulevard Saint-Germain 4.09
## 8 8 4.3 30 Montaigne 4.09
## 9 9 4 Goji Tarocco Orange 4.74
## 10 10 4 No. 12 Hacienda 4.79
## 11 11 4 Sandalwood Rose 5.38
## 12 12 3.6 Ash 3.16
That seems a little more readable. Let’s use this new variable to create a bar chart similar to the one above. Note that I’m removing the text for the scent names on the x-axis, as it got a little crowded. Instead, we can use the legend on the right to tell us which bar represents which candle.
viz <- ggplot(mean_session_times,aes(fill=scent_name,x=scent_name,y=mean)) + geom_bar(stat = "identity", ,col="brown") + theme(axis.text.x = element_blank())
viz + labs(title="Average Session Times",subtitle="Data from Candlegraph",x="Scent Name",y="Hours",fill="Scent Name")
That’s much better! However, because we are labeling by the
scent_name rather than the candle_id, the list
is now in alphabetical instead of numerical order. To preserve the
numerical order, we can use the dplyr library. (I
discovered this trick from Reorder
a variable with ggplot2.)
viz <- mean_session_times %>%
arrange(mean) %>%
mutate(scent_name=factor(scent_name,levels=scent_name)) %>%
ggplot(aes(fill=scent_name,x=scent_name,y=mean)) + geom_bar(stat = "identity", ,col="brown") + theme(axis.text.x = element_blank())
viz + labs(title="Average Session Times",subtitle="Data from Candlegraph",x="Scent Name",y="Hours",fill="Scent Name")
Now the mean burn time for each session is listed in order of
candle_id, and you can easily see the progession.
Thanks for reading!