Today, we’ll work on just one (or 1.5) topic(s).
In labs to date, I’ve been very irresponsible in my pedagogical approach, using a ton of dplyr
verbs without going through this technology. That’s rectified today.
Then, we’ll touch on simple example of reshaping.
First, we’ll download some data on the Billboard Hot 100, gathering all chart locations for popular music in the United States, between June 1958
If you’d like to be sure you have the right libraries, first install them as follows.
install.packages(c("plyr", "tidyverse", "magrittr", "lubridate"))
(No need to install these packages if you’ve successfully used them on your machine previously.)
Now we’ll load the packages, and the data:
library(plyr)
library(tidyverse)
library(magrittr)
library(lubridate)
d1 <- "https://github.com/thomasjwood/ps4160/raw/master/billboard_58_21.rds" %>%
url %>%
gzcon %>%
readRDS
To make sure everything loaded ok
d1$week_id %>% year %>% table
Ok–dplyr
The dplyr
verbs
dplyr
is principally a psychological innovation. dplyr
is an attempt to abstract away the computational details of what you intend to do with some data matrix, and simply express your intention in the most concise possible form. Do you want to
• Look at only some specific cases (ie, what are is the typical chart performance for songs released in the summer time, among songs with high spotify-scored ‘danceability’ scores?)?
• Compute a summary statistic for some group of cases (ie, compare total weeks in the charts for songs that debut in the top 10, and those songs which debuted outside of the top-10?)
• Sort rows, by some characteristic inside the data set (ie, by total weeks spent in the top 100?)
While there are a million competitors which do precisely this–SQL, Pandas, Spark, data.table, etc… but for people with minimal technical background, I think this is the most elegant, tractable approach.
Here are the 5 major dplyr
verbs:

Here is a gratuitous textual recapitulation:
filter
You pass a set of logical conditions, and filter
only returns the rows which match
select
You pass a set of logical conditions, and select
returns only those columns which match
mutate
You generate a new variable, often through computation
summarize
You replace separate dis-aggregated rows with a summary row, where each row summarizes (aggregates) the separate data in the old table.
arrange
You sort rows
Omitted here is the equally useful group_by
, which allows you to perform separate calculations for each group of rows.
Maybe some of these processes will be more tractable through some simple examples?
filter
examples
Let’s do some filtering. If we want to only look at songs which hit #1:
d1 %>%
filter(
week_position == 1
)
Or songs that debuted at #1
d1 %>%
filter(
week_position == 1 &
weeks_on_chart == 1
)
Or songs that debuted at #1 this year
d1 %>%
filter(
week_position == 1 &
weeks_on_chart == 1 &
week_id %>%
year == 2021
)
Or songs that debuted at #1 in 2021, 2020, or2019
d1 %>%
filter(
week_position == 1 &
weeks_on_chart == 1 &
(
week_id %>%
year == 2021 |
week_id %>%
year == 2020 |
week_id %>%
year == 2019
)
)
So this vertical bar operator |
is used as the logical operator “or”.
In the above, the parentheses are very important, since they string together:
the requirement to have the week_position
equal 1, AND
the requirement to have the week_id equal 1, AND
the year equal 2021, OR
the year equal 2020, OR
the year equal 2019
select
examples
This is used to select
only a small number of columns. This is a very useful utility for when we join between tables, and for reshaping.
We can select
using bare variable names
d1 %>%
select(
week_id, week_position, song, performer
)
Or we can select
using values which correspond to column positions
d1 %>%
select(
2:5, 20:22
)
The range operator :
works for names too:
d1 %>%
select(
week_id:song, danceability:loudness
)
There are also very important convenience functions, which allow you to make selections when there’s a pattern of names you’d like to select. Examples include contains()
, ends_with()
, starts_with(),
or everything()
.
d1 %>%
select(
starts_with("week_"),
ends_with("_position")
)
summarize
examples
Imagine we want to know how many artists have had top ten debut songs, by year. We’re not interested (for the moment) in all the specific variation encapsulated inside each separate row. Instead, we just want a single numeric summary for each year group.
d1 %>%
filter(
weeks_on_chart %>%
equals(1) &
week_position <= 10
) %>%
group_by(
week_id %>% year
) %>%
summarize(
num_performers = performer %>%
unique %>%
length
)
Or imagine we want to know the songs which spent the most weeks on the chart
d1 %>%
group_by(performer, song) %>%
summarize(
week_id = week_id %>% min,
tote_weeks = weeks_on_chart %>% max
)
Which should return
performer song week_id tote_weeks
<chr> <chr> <date> <dbl>
1 'N Sync (God Must Have Spent) A Little More Time On You 1998-12-05 22
2 'N Sync Bye Bye Bye 2000-01-29 23
3 'N Sync Gone 2001-09-22 24
4 'N Sync I Drive Myself Crazy 1999-04-24 12
5 'N Sync I Want You Back 1998-03-07 24
6 'N Sync It's Gonna Be Me 2000-05-06 25
7 'N Sync Pop 2001-06-02 15
8 'N Sync Tearin' Up My Heart 1998-12-05 1
9 'N Sync This I Promise You 2000-09-30 26
10 'N Sync & Gloria Estefan Music Of My Heart 1999-09-04 20
Oh no–this table is not very nicely sorted. Instead, we need to do:
d1 %>%
group_by(performer, song) %>%
summarize(
week_id = week_id %>% min,
tote_weeks = weeks_on_chart %>% max
) %>%
arrange(
desc(tote_weeks)
)
which is
performer song week_id tote_weeks
<chr> <chr> <date> <dbl>
1 Imagine Dragons Radioactive 2012-08-18 87
2 AWOLNATION Sail 2011-09-03 79
3 Jason Mraz I'm Yours 2008-05-03 76
4 The Weeknd Blinding Lights 2019-12-14 76
5 LeAnn Rimes How Do I Live 1997-06-21 69
6 LMFAO Featuring Lauren Bennett & GoonRock Party Rock Anthem 2011-02-12 68
7 OneRepublic Counting Stars 2013-07-06 68
8 Adele Rolling In The Deep 2010-12-25 65
9 Jewel Foolish Games/You Were Meant For Me 1996-11-30 65
10 Carrie Underwood Before He Cheats 2006-09-16 64
arrange
examples
Our final verb provides the ability to sort rows (we used it above to get a sense of the songs which persisted the longest on the Billboard 100.) For instance, which are the most “danceable” songs?
d1 %>%
group_by(song_id) %>%
slice(1) %>%
ungroup %>%
arrange(desc(danceability)) %>%
select(week_id:performer)
should return
week_id week_position song performer
<date> <dbl> <chr> <chr>
1 1989-03-04 60 Funky Cold Medina Tone-Loc
2 2007-12-15 92 Go Girl Pitbull Featuring Trina & Young Bo$$
3 2017-03-04 88 Cash Me Outside (CashMeOutside) DJ Suede The Remix God
4 2010-05-22 74 Ice Ice Baby Glee Cast
5 1984-06-30 30 State of Shock The Jacksons
6 1990-09-08 72 Ice Ice Baby Vanilla Ice
7 2019-08-17 82 Uno Ambjaay
8 2019-08-31 32 Bad Bad Bad Young Thug Featuring Lil Baby
9 1983-01-15 90 Betcha She Don't Love You Evelyn King
10 2002-07-13 95 In Da Wind Trick Daddy Featuring Cee-Lo & Big Boi
# ... with 29,379 more rows
pivot_longer
and pivot_wider
examples
This is a tough one to communicate to students. The principle things to keep in mind about reshaping
No data are introduced or omitted. Every pair of cell coordinate–variable names and the ID vector in the example below, remain unchanged in each form of the data–whether it’s wide (on the left) or wide (on the right.)
The most useful typical example of this process is when you want to do a bunch of calculations on separate variables. Now, you might be inclined/equipped to write a clever for loop, which operates on each separate variable. But if you don’t want to learn how to do this, instead, just encode the difference variable names, themselves, in a variable. That’s what we’re doing in the example below.

Let’s think of an example–which of the song attributes are most strongly related to a song’s peak position?
d1 %>%
group_by(song_id) %>%
slice(
peak_position %>%
which.min
) %>%
select(
song, performer, song_id, peak_position,
danceability:tempo
) %>%
pivot_longer(
cols = danceability:tempo,
names_to = "attr",
values_to = "val",
values_drop_na = T
) %>%
group_by(
attr
) %>%
summarise(
rel = cor(peak_position, val)
) %>%
arrange(
desc(rel)
)
pivot_longer
is saying:
apply the pivot longer mechanism to the variables between danceability
and tempo
call my vector of variable names “attr
”
call my vector of variable values “val
”
Then we compute the correlation for each gathered variable, comparing that variable to the overall margin
Gosh how did I make a whole set of lab notes without including a single plot!?

Let this be a lesson to us all – de-emphasize speechiness to maximize the Billboard performance of your next single.
---
title: "PS 4160 Lab 3"
author: "Thomas Wood, Political Science--Ohio State, wood.1080@osu.edu"
output:
  html_notebook:
    theme: cerulean
  html_document:
    df_print: paged
---

Today, we'll work on just one (or 1.5) topic(s).

1.  In labs to date, I've been very irresponsible in my pedagogical approach, using a ton of `dplyr` verbs without going through this technology. That's rectified today.

2.  Then, we'll touch on simple example of reshaping.

First, we'll download some data on the [Billboard Hot 100](https://www.billboard.com/charts/hot-100/), gathering all chart locations for popular music in the United States, between June 1958

If you'd like to be sure you have the right libraries, first install them as follows.

``` {#install_packages .r}
install.packages(c("plyr", "tidyverse", "magrittr", "lubridate"))
```

(No need to install these packages if you've successfully used them on your machine previously.)\

Now we'll load the packages, and the data:

``` {#load_data .r .R}
library(plyr)
library(tidyverse)
library(magrittr)
library(lubridate)

d1 <- "https://github.com/thomasjwood/ps4160/raw/master/billboard_58_21.rds" %>%
  url %>%
  gzcon %>%
  readRDS 
```

To make sure everything loaded ok

``` r
d1$week_id %>% year %>% table
```

Ok--`dplyr`

## The `dplyr` verbs

`dplyr` is principally a psychological innovation. `dplyr` is an attempt to abstract away the computational details of what you intend to do with some data matrix, and simply express your intention in the most concise possible form. Do you want to

• Look at only some specific cases (ie, what are is the typical chart performance for songs released in the summer time, among songs with high spotify-scored '*danceability*' scores?)?

• Compute a summary statistic for some *group* of cases (ie, compare total weeks in the charts for songs that debut in the top 10, and those songs which debuted outside of the top-10?)

• Sort rows, by some characteristic inside the data set (ie, by total weeks spent in the top 100?)

While there are a million competitors which do precisely this--SQL, Pandas, Spark, data.table, etc... but for people with minimal technical background, I think this is the most elegant, tractable approach.

Here are the 5 major `dplyr` verbs:

![](five_verb_diag.jpg)

Here is a gratuitous textual recapitulation:

**`filter`** You pass a set of *logical conditions*, and **`filter`** only returns the rows which match

**`select`** You pass a set of *logical conditions*, and **`select`** returns only those columns which match

**`mutate`** You generate a new variable, often through computation

**`summarize`** You replace separate dis-aggregated rows with a summary row, where each row summarizes (aggregates) the separate data in the old table.

**`arrange`** You sort rows

Omitted here is the equally useful **`group_by`** , which allows you to perform separate calculations for each group of rows.

Maybe some of these processes will be more tractable through some simple examples?

## `filter` examples

Let's do some filtering. If we want to only look at songs which hit #1:

``` r
d1 %>% 
  filter(
    week_position == 1
  )
```

Or songs that debuted at #1

``` r
d1 %>% 
  filter(
    week_position == 1 &
    weeks_on_chart == 1
  )
```

Or songs that debuted at #1 this year

``` r
d1 %>% 
  filter(
    week_position == 1 &
    weeks_on_chart == 1 &
    week_id %>% 
      year == 2021
  )
```

Or songs that debuted at #1 in 2021, 2020, ***or***2019

``` r
d1 %>% 
  filter(
    week_position == 1 &
    weeks_on_chart == 1 &
      (
        week_id %>%
          year == 2021 |
        week_id %>%
          year == 2020 |
        week_id %>%
          year == 2019
      )
    )
```

So this vertical bar operator `|` is used as the logical operator "*or*".

In the above, the parentheses are very important, since they string together:

-   the requirement to have the `week_position` equal 1, **AND**

-   the requirement to have the **week_id** equal 1, **AND**

    -   the year equal 2021, **OR**

    -   the year equal 2020, **OR**

    -   the year equal 2019

## `select` examples

This is used to `select` only a small number of columns. This is a very useful utility for when we join between tables, and for reshaping.

We can `select` using bare variable names

``` r
d1 %>% 
  select(
    week_id, week_position, song, performer
  )
```

Or we can `select` using values which correspond to column positions

``` r
d1 %>% 
  select(
    2:5, 20:22
  )
```

The range operator `:` works for names too:

``` r
d1 %>% 
  select(
    week_id:song, danceability:loudness
    )
```

There are also very important *convenience functions*, which allow you to make selections when there's a *pattern* of names you'd like to select. Examples include `contains()`, `ends_with()`, `starts_with(),` or `everything()`.

``` r
d1 %>% 
  select(
    starts_with("week_"), 
    ends_with("_position")
    )
```

## `summarize` examples

Imagine we want to know how many artists have had top ten debut songs, by year. We're not interested (for the moment) in all the specific variation encapsulated inside each separate row. Instead, we just want a single numeric summary for each year group.

``` r
d1 %>% 
  filter(
    weeks_on_chart %>% 
      equals(1) & 
    week_position <= 10
    ) %>% 
  group_by(
    week_id %>% year
    ) %>% 
  summarize(
    num_performers = performer %>% 
      unique %>% 
      length
  )
```

Or imagine we want to know the songs which spent the most weeks on the chart

``` r
d1 %>% 
  group_by(performer, song) %>% 
  summarize(
    week_id = week_id %>% min,
    tote_weeks = weeks_on_chart %>% max
    )
```

Which should return

       performer                song                                            week_id    tote_weeks
       <chr>                    <chr>                                           <date>          <dbl>
     1 'N Sync                  (God Must Have Spent) A Little More Time On You 1998-12-05         22
     2 'N Sync                  Bye Bye Bye                                     2000-01-29         23
     3 'N Sync                  Gone                                            2001-09-22         24
     4 'N Sync                  I Drive Myself Crazy                            1999-04-24         12
     5 'N Sync                  I Want You Back                                 1998-03-07         24
     6 'N Sync                  It's Gonna Be Me                                2000-05-06         25
     7 'N Sync                  Pop                                             2001-06-02         15
     8 'N Sync                  Tearin' Up My Heart                             1998-12-05          1
     9 'N Sync                  This I Promise You                              2000-09-30         26
    10 'N Sync & Gloria Estefan Music Of My Heart                               1999-09-04         20

Oh no--this table is not very nicely sorted. Instead, we need to do:

``` r
d1 %>% 
  group_by(performer, song) %>% 
  summarize(
    week_id = week_id %>% min,
    tote_weeks = weeks_on_chart %>% max
    ) %>% 
  arrange(
    desc(tote_weeks)
    )
```

which is

         performer                                 song                                week_id    tote_weeks
       <chr>                                     <chr>                               <date>          <dbl>
     1 Imagine Dragons                           Radioactive                         2012-08-18         87
     2 AWOLNATION                                Sail                                2011-09-03         79
     3 Jason Mraz                                I'm Yours                           2008-05-03         76
     4 The Weeknd                                Blinding Lights                     2019-12-14         76
     5 LeAnn Rimes                               How Do I Live                       1997-06-21         69
     6 LMFAO Featuring Lauren Bennett & GoonRock Party Rock Anthem                   2011-02-12         68
     7 OneRepublic                               Counting Stars                      2013-07-06         68
     8 Adele                                     Rolling In The Deep                 2010-12-25         65
     9 Jewel                                     Foolish Games/You Were Meant For Me 1996-11-30         65
    10 Carrie Underwood                          Before He Cheats                    2006-09-16         64

## `arrange`  examples

Our final verb provides the ability to sort rows (we used it above to get a sense of the songs which persisted the longest on the Billboard 100.) For instance, which are the most "danceable" songs?

``` r
d1 %>% 
  group_by(song_id) %>% 
  slice(1) %>% 
  ungroup %>% 
  arrange(desc(danceability)) %>% 
  select(week_id:performer)
```

should return

       week_id    week_position song                             performer                             
       <date>             <dbl> <chr>                            <chr>                                 
     1 1989-03-04            60 Funky Cold Medina                Tone-Loc                              
     2 2007-12-15            92 Go Girl                          Pitbull Featuring Trina & Young Bo$$  
     3 2017-03-04            88 Cash Me Outside (CashMeOutside)  DJ Suede The Remix God                
     4 2010-05-22            74 Ice Ice Baby                     Glee Cast                             
     5 1984-06-30            30 State of Shock                   The Jacksons                          
     6 1990-09-08            72 Ice Ice Baby                     Vanilla Ice                           
     7 2019-08-17            82 Uno                              Ambjaay                               
     8 2019-08-31            32 Bad Bad Bad                      Young Thug Featuring Lil Baby         
     9 1983-01-15            90 Betcha She Don't Love You        Evelyn King                           
    10 2002-07-13            95 In Da Wind                       Trick Daddy Featuring Cee-Lo & Big Boi
    # ... with 29,379 more rows

## `pivot_longer` and `pivot_wider` examples

This is a tough one to communicate to students. The principle things to keep in mind about reshaping

-   No data are introduced or omitted. Every pair of cell coordinate--variable names and the ID vector in the example below, remain unchanged in each form of the data--whether it's wide (on the left) or wide (on the right.)

-   The most useful typical example of this process is when you want to do a bunch of calculations on separate variables. Now, you might be inclined/equipped to write a clever for loop, which operates on each separate variable. But if you don't want to learn how to do this, instead, **just encode the difference variable names, themselves, in a variable.** That's what we're doing in the example below.

![](tidyr_diag.jpg){width="700"}

Let's think of an example--which of the song attributes are most strongly related to a song's peak position?

``` r
d1 %>% 
  group_by(song_id) %>% 
  slice(
    peak_position %>% 
      which.min
    ) %>% 
  select(
    song, performer, song_id, peak_position, 
    danceability:tempo
    ) %>% 
  pivot_longer(
    cols = danceability:tempo, 
    names_to = "attr", 
    values_to = "val", 
    values_drop_na = T
    ) %>% 
  group_by(
    attr
  ) %>% 
  summarise(
    rel = cor(peak_position, val)
    ) %>% 
  arrange(
    desc(rel)
  )
```

`pivot_longer` is saying:

-   apply the pivot longer mechanism to the variables between `danceability` and `tempo`

-   call my vector of variable names "`attr`"

-   call my vector of variable values "`val`"

Then we compute the correlation for each gathered variable, comparing that variable to the overall margin

Gosh how did I make a whole set of lab notes without including a single plot!?

![](position_cor.jpeg){width="700"}

Let this be a lesson to us all -- *de-emphasize* speechiness to maximize the Billboard performance of your next single.
