Setup Packages
library(tidyverse) # For data wrangling purposes and ggplot
library(tidytuesdayR) # For TT data
library(papaja) # For APA theme
Load the Data
Download the weekly data and make available in the tt object.
# Extract data using instructions from TT readme file
tt <- tt_load("2021-09-14")
# Assign each data set to a new data frame
billboard <- tt$billboard
audio_features <- tt$audio_features
Wrangle
Select a subset of interesting variables. In this case, I want to know if song danceability is associated with song popularity. I can select the appropriate variables from the data set using the select() function and remove all rows containing NAs in either danceability or popularity by using drop_na().
# From audio_features data set, select interesting variables and drop the rows which contain NAs for danceability and popularity.
audio <- audio_features %>%
select(song, performer, spotify_genre, danceability, spotify_track_popularity) %>%
drop_na(danceability, spotify_track_popularity)
glimpse(audio)
## Rows: 24,330
## Columns: 5
## $ song <chr> "......And Roses And Roses", "...And Then The…
## $ performer <chr> "Andy Williams", "Sandy Nelson", "Britney Spe…
## $ spotify_genre <chr> "['adult standards', 'brill building pop', 'e…
## $ danceability <dbl> 0.154, 0.588, 0.759, 0.613, 0.647, 0.450, 0.8…
## $ spotify_track_popularity <dbl> 38, 11, 77, 73, 40, 31, 29, 42, 42, 10, 70, 2…
Visualise
I want to visualise a relationship between two numeric variables, so a scatterplot would be appropriate. But, since there are over 24,000 individual data points, a plain scatterplot will be messy and uninformative. Instead of using geom_point(), I can use stat_binhex() which bins the data points into hexagons, which are coloured across a gradient according to how many data points they represent. I chose the specific gradient colours using sclae_fill_gradient() and assigning the ‘low’ and ‘high’ colours according to the ggplot colour cheat-sheet. I also apply the theme_apa() layer from the papaja package to make the plot APA style (-ish).
dance_plot <- ggplot(audio, mapping = aes(x = danceability, y = spotify_track_popularity)) +
stat_binhex(alpha = 0.8) +
scale_fill_gradient(low = "navyblue", high = "red2") +
theme_apa() +
ylab("Track Popularity") +
xlab("Track Danceability")
print(dance_plot)

This plot is okay but I think I can do better.
Re-Visualise
I might try to create a series of box plots showing the popularity of songs that fall into different bins of danceability. First, I must create a new character object containing the labels of my bins. Whilst danceability in this data set is coded from 0-1, I think I might code my tags as 1-10 because lots of decimal values will just crowd the x-axis of a plot.
tags <- c("(0-1)","(1-2)", "(2-3)", "(3-4)", "(4-5)",
"(5-6)","(6-7)", "(7-8)","(8-9)", "(9-10)")
After creating my tags, it is time to create a new data frame containing all of my relevant variables. First, I select danceability and popularity and put them together in a new data frame called dance. Then, I use the mutate() function to create a new variable called ‘tag’. The case_when() function is used to effective say “when danceability falls between x and y, assign it the nth tag from my object tags” which I created above. For example, looking at the first case, when danceability is less than 0.1, the row is assigned the first tag in the list (which would be “0-1”). If the danceability is between 0.1 and 0.2, it is assigned the second tag in the list (which would be “1-2”). This allows me to bin my danceability scores into 10 bins.
dance <- audio %>%
select(danceability, spotify_track_popularity)
dance <- as_tibble(dance) %>%
mutate(tag = case_when(
danceability < 0.1 ~ tags[1],
danceability >= 0.1 & danceability < 0.2 ~ tags[2],
danceability >= 0.2 & danceability < 0.3 ~ tags[3],
danceability >= 0.3 & danceability < 0.4 ~ tags[4],
danceability >= 0.4 & danceability < 0.5 ~ tags[5],
danceability >= 0.5 & danceability < 0.6 ~ tags[6],
danceability >= 0.6 & danceability < 0.7 ~ tags[7],
danceability >= 0.7 & danceability < 0.8 ~ tags[8],
danceability >= 0.8 & danceability < 0.9 ~ tags[9],
danceability >= 0.9 & danceability < 1.0 ~ tags[10]))
Right now the tag variable is a character variable. I must transform it into a factor variable before using it to create multiple boxplots. This is done using the factor() function. Instead of writing out each level of the factor individually, I simply assign the tags I created earlier as the levels. To avoid R from re-ordering any of my factors, I set ‘ordered = FALSE’. Notice each danceability score now has a corresponding tag.
dance$tag <- factor(dance$tag,
levels = tags,
ordered = FALSE)
glimpse(dance)
## Rows: 24,330
## Columns: 3
## $ danceability <dbl> 0.154, 0.588, 0.759, 0.613, 0.647, 0.450, 0.8…
## $ spotify_track_popularity <dbl> 38, 11, 77, 73, 40, 31, 29, 42, 42, 10, 70, 2…
## $ tag <fct> (1-2), (5-6), (7-8), (6-7), (6-7), (4-5), (8-…
Now it is time to create the plot. Using ggplot, I can assign my tag (danceability bin) to the x-axis and the track popularity to the y-axis. Because there will still be thousands of data points in each bin, I use geom_jitter() to add some small amount of noise to the data to make it easier to look at. I then add my boxplot layer with geom_boxplot(), choosing my fill and outline colours. The ‘alpha’ argument simply changes the transparency of the element. I want the boxplots to be more visible than the individual data points, so I assign it a higher alpha level. I then assign axis labels using the labs() function, and use guides(colour = FALSE) to simply remove the useless legend on the side of the plot. Finally, I add the APA theme using theme_apa().
ggplot(data = dance, mapping = aes(x = tag, y = spotify_track_popularity)) +
geom_jitter(aes(color = 'blue'),
alpha = 0.1) +
geom_boxplot(fill = "bisque",
color = "black",
alpha = 0.7) +
labs(x = 'Track Danceability (x10)',
y = "Track Popularity") +
guides(color = FALSE) +
theme_apa()

This plot is much nicer to look at. It seems that popularity might on average increase slightly as danceability increases. However, the box plots reveal that popularity ranges from both extremes at each level of danceability. Interestingly enough, very few songs have almost no danceability. If you want to download the raw Rmd file for this doc click the ‘code’ button at the top right of the document.
---
title: "TTWK38 - Billboards"
date: 15-09-2021
output: 
  html_document:
    toc_float: true
    theme: cosmo
    toc: true
    code_download: true
---

# Setup Packages

```{r setup, include=FALSE}

knitr::opts_chunk$set(
	echo = TRUE,
	message = FALSE,
	warning = FALSE
)
```

```{r message=FALSE, warning=FALSE}
library(tidyverse) # For data wrangling purposes and ggplot
library(tidytuesdayR) # For TT data
library(papaja) # For APA theme
```

# Load the Data

Download the weekly data and make available in the `tt` object.

```{r Load, message=FALSE, warning=FALSE, results='hide'}
# Extract data using instructions from TT readme file
tt <- tt_load("2021-09-14")

# Assign each data set to a new data frame
billboard <- tt$billboard
audio_features <- tt$audio_features
```

# Wrangle

Select a subset of interesting variables. In this case, I want to know if song  **danceability** is associated with song **popularity**. I can select the appropriate variables from the data set using the `select()` function and remove all rows containing NAs in either danceability or popularity by using `drop_na()`. 

```{r Wrangle, message=FALSE, warning=FALSE}
# From audio_features data set, select interesting variables and drop the rows which contain NAs for danceability and popularity.

audio <- audio_features %>%
  select(song, performer, spotify_genre, danceability, spotify_track_popularity) %>%
  drop_na(danceability, spotify_track_popularity)

glimpse(audio)
```
# Visualise

I want to visualise a relationship between two numeric variables, so a scatterplot would be appropriate. But, since there are over 24,000 individual data points, a plain scatterplot will be messy and uninformative. Instead of using `geom_point()`, I can use `stat_binhex()` which bins the data points into hexagons, which are coloured across a gradient according to how many data points they represent. I chose the specific gradient colours using `sclae_fill_gradient()` and assigning the 'low' and 'high' colours according to the [ggplot colour cheat-sheet](http://sape.inf.usi.ch/quick-reference/ggplot2/colour). I also apply the `theme_apa()` layer from the **papaja** package to make the plot APA style (-ish).

```{r Visualize}
dance_plot <- ggplot(audio, mapping = aes(x = danceability, y = spotify_track_popularity)) +
  stat_binhex(alpha = 0.8) +
  scale_fill_gradient(low = "navyblue", high = "red2") +
  theme_apa() +
  ylab("Track Popularity") +
  xlab("Track Danceability")

print(dance_plot)
```
  
  
This plot is okay but I think I can do better. 

# Re-Visualise 

I might try to create a series of box plots showing the popularity of songs that fall into different bins of danceability. First, I must create a new character object containing the labels of my bins. Whilst danceability in this data set is coded from 0-1, I think I might code my tags as 1-10 because lots of decimal values will just crowd the x-axis of a plot.


```{r message=FALSE, warning=FALSE}
tags <- c("(0-1)","(1-2)", "(2-3)", "(3-4)", "(4-5)", 
          "(5-6)","(6-7)", "(7-8)","(8-9)", "(9-10)")
```


After creating my tags, it is time to create a new data frame containing all of my relevant variables. First, I select danceability and popularity and put them together in a new data frame called **dance**. Then, I use the `mutate()` function to create a new variable called 'tag'. The `case_when()` function is used to effective say *"when danceability falls between x and y, assign it the nth tag from my object **tags**"* which I created above. For example, looking at the first case, when danceability is less than 0.1, the row is assigned the first tag in the list (which would be "0-1"). If the danceability is between 0.1 and 0.2, it is assigned the second tag in the list (which would be "1-2"). This allows me to bin my danceability scores into 10 bins.


```{r message=FALSE, warning=FALSE}
dance <- audio %>%
  select(danceability, spotify_track_popularity)

dance <- as_tibble(dance) %>% 
  mutate(tag = case_when(
    danceability < 0.1 ~ tags[1],
    danceability >= 0.1 & danceability < 0.2 ~ tags[2],
    danceability >= 0.2 & danceability < 0.3 ~ tags[3],
    danceability >= 0.3 & danceability < 0.4 ~ tags[4],
    danceability >= 0.4 & danceability < 0.5 ~ tags[5],
    danceability >= 0.5 & danceability < 0.6 ~ tags[6],
    danceability >= 0.6 & danceability < 0.7 ~ tags[7],
    danceability >= 0.7 & danceability < 0.8 ~ tags[8],
    danceability >= 0.8 & danceability < 0.9 ~ tags[9],
    danceability >= 0.9 & danceability < 1.0 ~ tags[10]))
```


Right now the **tag** variable is a character variable. I must transform it into a factor variable before using it to create multiple boxplots. This is done using the `factor()` function. Instead of writing out each level of the factor individually, I simply assign the tags I created earlier as the levels. To avoid R from re-ordering any of my factors, I set 'ordered = FALSE'. Notice each danceability score now has a corresponding tag.


```{r message=FALSE, warning=FALSE}
dance$tag <- factor(dance$tag,
                          levels = tags,
                          ordered = FALSE)

glimpse(dance)
```
  
  
Now it is time to create the plot. Using ggplot, I can assign my tag (danceability bin) to the x-axis and the track popularity to the y-axis. Because there will still be thousands of data points in each bin, I use `geom_jitter()` to add some small amount of noise to the data to make it easier to look at. I then add my boxplot layer with `geom_boxplot()`, choosing my fill and outline colours. The 'alpha' argument simply changes the transparency of the element. I want the boxplots to be more visible than the individual data points, so I assign it a higher alpha level. I then assign axis labels using the `labs()` function, and use `guides(colour = FALSE)` to simply remove the useless legend on the side of the plot. Finally, I add the APA theme using `theme_apa()`.
  
  
```{r message=FALSE, warning=FALSE}
ggplot(data = dance, mapping = aes(x = tag, y = spotify_track_popularity)) + 
  geom_jitter(aes(color = 'blue'), 
              alpha = 0.1) +
  geom_boxplot(fill = "bisque", 
               color = "black", 
               alpha = 0.7) + 
  labs(x = 'Track Danceability (x10)', 
       y = "Track Popularity") +
  guides(color = FALSE) +
  theme_apa()
```

This plot is much nicer to look at. It seems that popularity might on average increase slightly as danceability increases. However, the box plots reveal that popularity ranges from both extremes at each level of danceability. Interestingly enough, very few songs have almost no danceability. If you want to download the raw Rmd file for this doc click the 'code' button at the top right of the document.
