dplyr
The following is a sample analysis of a small dataset of box-office receipts from the old Burgtheater in Vienna in 1791. The dataset covers 60 opera performances from the beginning of the new theatrical season on 26 Apr 1791 (the Tuesday after Easter), to Tue, 4 Oct 1791. I’ve chosen this end point because the following day an anonymous correspondent in Vienna sent a gossipy report to the Bayreuther Zeitung about the current Viennese theatrical scene. The correspondent claims that the Viennese public had become bored with recent offerings in the Burgtheater (one of the two theaters administered by the imperial court), and seemed to prefer the latest offerings in Vienna’s suburban theaters. One of those new offerings was Mozart’s Die Zauberflöte (The Magic Flute), which was premiered in the Theater auf der Wieden just five days earlier, on 30 Sep; in fact, the report in the Bayreuther Zeitung is one of the earliest known references to Mozart’s opera in print.1 My simple statistical analysis here draws on the surviving records of the box-office receipts from the Burghteater to evaluate the correspondent’s implication that attendance at the theater had declined because the public was bored with its offerings.2
My analysis uses the R language with the add-on packages dplyr
, lubridate
, and ggplot2
. This document was prepared in RStudio using R Markdown and knitr
. The document is not intended as an introduction to R (there are many good ones available!), but I have tried to give clear explanations of basic concepts that may not be familiar to scholars in the humanities who are seeing an analysis of this sort for the first time. It is my hope that this document will begin to show that simple statistical analyses, requiring no sophisticated statistical background and using free and open-source tools that are easy to install and use, can produce useful insights from datasets like the one analyzed here.
I began by entering the data on the box-office receipts into a spreadsheet (Google Sheets are convenient for this purpose). In order to keep the demonstration here relatively uncomplicated, the data consists of just four columns (or variables): ‘Date’ (in the form ‘1791-04-26’), ‘Title’, ‘fl’ (the abbreviation for gulden or “florin”), and ‘kr’ (the abbreviation for kreuzer; gulden and kreuzer were the monetary units in which accounts were kept at that time). We’ll assume that this data has already been exported from the spreadsheet to a csv
(comma-separated values) file, a widely used format for importing and exporting tabular data. The data in the csv
file looks like this:
Date,Title,fl,kr
1791-04-26,La bella pescatrice,483,10
1791-04-28,La bella pescatrice,418,42
1791-04-30,La bella pescatrice,325,11
1791-05-02,Il talismano,84,41
1791-05-04,La bella pescatrice,397,48
1791-05-06,Il talismano,81,6
1791-05-08,La bella pescatrice,479,34
...
The file can be downloaded from Google Drive at the following link, if you’d like to follow along with the analysis: Burgtheater_Receipts.csv
When you click on the link, Google may open the document in a spreadsheet. To download the file in that case, navigate through the menu to:
File> Download as > Comma-separated values (.csv, current sheet)
.
First we’ll use the R function library()
to load the dplyr
, lubridate
, and ggplot2
packages, which we’ll be using later on. We then import Burgtheater_Receipts.csv
into an R “dataframe” named receipts
, using the function read.table()
.3 (We’ll assume that the csv
file is already in our current working directory, so the full pathname can be omitted.) The symbol <-
is R’s assignment operator.
suppressMessages(library(dplyr))
library(lubridate)
library(ggplot2)
receipts <- read.table("Burgtheater_Receipts.csv", header=TRUE, sep=",")
This is less intimidating than it may look to a beginner. Loading packages into R is simple and painless, especially if you are using RStudio. I’ve used the suppressMessages()
function when loading dplyr
, because that package produces several messages when loading that are not important to us here. The argument header=TRUE
in the function read.table()
tells R that the first row of the file contains the column names, which we want to keep. The argument sep=","
tells R that a comma is used to separate the data items in each row of the file.
This creates an R dataframe
named receipts
. We can look at the first few rows of receipts
using the head()
function from base R.4
head(receipts)
## Date Title fl kr
## 1 1791-04-26 La bella pescatrice 483 10
## 2 1791-04-28 La bella pescatrice 418 42
## 3 1791-04-30 La bella pescatrice 325 11
## 4 1791-05-02 Il talismano 84 41
## 5 1791-05-04 La bella pescatrice 397 48
## 6 1791-05-06 Il talismano 81 6
The function by default shows the first six rows, but that number can be set to anything you like.
We can use tail()
to look at the last few rows of receipts
:
tail(receipts)
## Date Title fl kr
## 55 1791-09-24 I zingari in fiera 180 54
## 56 1791-09-26 La bella pescatrice 218 21
## 57 1791-09-27 La molinara 239 33
## 58 1791-09-30 I zingari in fiera 162 18
## 59 1791-10-02 I zingari in fiera 238 32
## 60 1791-10-04 Il pazzo per forza 118 30
We can examine the structure of the dataframe using the function str()
, also from base R. This is often a good first step in working with a dataframe in R, to make sure everything is as you expect.
str(receipts)
## 'data.frame': 60 obs. of 4 variables:
## $ Date : Factor w/ 60 levels "1791-04-26","1791-04-28",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Title: Factor w/ 7 levels "I zingari in fiera",..: 4 4 4 3 4 3 4 4 4 5 ...
## $ fl : int 483 418 325 84 397 81 479 251 174 467 ...
## $ kr : int 10 42 11 41 48 6 34 58 49 42 ...
This tells us that the receipts
dataframe has 60 “observations” (that is, 60 rows), each corresponding to a performance on a particular date. Information about each of the four columns (or “variables”) in the dataframe is listed horizontally, with the column name following the $
symbol. (When we want to refer to a particular column in our dataframe, we’ll use the syntax receipts$Title
, meaning “the Title
column in the receipts
dataframe.”) Following the name of the column is the class of the data in that column. The columns fl
and kr
(gulden and kreuzer) are of class int
, or integer, as we would expect. Note that R doesn’t yet recognize the entries in the Date
column as dates, instead seeing them as Factors
. To convert the class of the Date
column, we use the as.Date()
function:
receipts$Date <- as.Date(receipts$Date)
str(receipts)
## 'data.frame': 60 obs. of 4 variables:
## $ Date : Date, format: "1791-04-26" "1791-04-28" ...
## $ Title: Factor w/ 7 levels "I zingari in fiera",..: 4 4 4 3 4 3 4 4 4 5 ...
## $ fl : int 483 418 325 84 397 81 479 251 174 467 ...
## $ kr : int 10 42 11 41 48 6 34 58 49 42 ...
Now R recognizes the dates as belonging to the Date
class, which is what we want. R has automatically read the entries in the Title
column as being of class Factor
instead of (as we might expect) class character
(in R, what is called a “string” in many programming languages is treated as a vector of individual characters). But in fact this turns out to be a good thing, as having the titles as factors
will make it easier to analyze the operas statistically by title. We see that the Title
column has 7 levels
, which tells us that only seven different operas were performed between 26 Apr and 4 Oct 1791.5
How many weeks was this? As is usually the case with R, there are many ways to answer that question, but for simplicity we’ll stick to a function in base R, difftime()
. The expression to find the number of weeks between our two dates is:
difftime(receipts$Date[60], receipts$Date[1], unit = "weeks")
## Time difference of 23 weeks
receipts$Date[1]
is the first date in the Date
column, namely 1791-04-26 (the number in square brackets is the “index” for that item in the column), and receipts$Date[60]
is the 60th (last) date in the Date
column, 1791-10-04; so the expression subtracts the earlier date from the later one, and gives the answer in weeks
. This already tells us something interesting about the Burgtheater’s orperatic repertoire at the time: only seven different operas were performed across those 23 weeks, so there was not a great deal of turnover or variety in the repertory, and this could be one of the things that the anonymous correspondent to the Bayreuther Zeitung was referring to in implying that the Viennese public was bored with the offerings in the Burgtheater.
We could also have found the interval between the first and last dates directly:
difftime("1791-10-04", "1791-04-26", unit = "weeks")
## Time difference of 23 weeks
What are the seven titles in the dataset? There are several ways to ask R for this information, but because the titles are of class factor
, a convenient one is to write:
levels(receipts$Title)
## [1] "I zingari in fiera" "Il pazzo per forza" "Il talismano"
## [4] "La bella pescatrice" "La molinara" "La pastorella nobile"
## [7] "Le trame deluse"
This expression says: “Show the levels of the factors in the Title
column in the receipts
dataframe.”
Before we begin our analysis of the data, it will be useful to add a new column called Total_kreuzer
to the receipts
dataframe, converting the receipts in gulden and kreuzer to the total number of kreuzer, so that we can calculate the mean (average), and other simple statistics. One gulden was equal to 60 kreuzer, so to calculate the total number of kreuzer in the box-office receipts for a given date, we use the formula Total_kreuzer = fl * 60 + kr
.
The traditional way to do this in base R would be:
receipts$Total_kreuzer <- receipts$fl * 60 + receipts$kr
This will add a new column Total_kreuzer
to the receipts
dataframe, with the total kreuzer calculated for each performance. (Notice that you create the new column simply by giving it a name and assigning something to it.) Because traditional R expressions can quickly become multiply embedded and quite difficult to parse visually, in this document I’ll mostly use functions and vocabulary from the dplyr
package, which will often turn out to be more intuitive and easy to read as our analysis becomes more complex. (For a good introduction to dplyr
emphasizing the clarity of dplyr
expressions over those in base R, see the posts by Kevin Markham here and here.)
As an alternative to the traditional R syntax, we can use the mutate()
function from the dplyr
package to add our new calculated column Total_kreuzer
to the receipts
dataframe. The %>%
operator (sometimes called the “pipe” operator) is used to “chain” or link expressions in sequence.
receipts <- receipts %>% mutate(Total_kreuzer = fl * 60 + kr)
We can read this as follows, starting just to the right of R’s assignment operator, <-
:
receipts
dataframe, create the new column Total_kreuzer
containing the value fl * 60 + kr
, and save the original dataframe with this new column using R’s assignment operator <-
. (If we don’t make the assignment, dplyr
won’t save the new column).We again can use head()
to see if this has done what we want:
head(receipts)
## Date Title fl kr Total_kreuzer
## 1 1791-04-26 La bella pescatrice 483 10 28990
## 2 1791-04-28 La bella pescatrice 418 42 25122
## 3 1791-04-30 La bella pescatrice 325 11 19511
## 4 1791-05-02 Il talismano 84 41 5081
## 5 1791-05-04 La bella pescatrice 397 48 23868
## 6 1791-05-06 Il talismano 81 6 4866
So our dataframe now has five columns or variables.
We can obtain a summary of our box-office data using the function summary()
from base R. This will give us some useless information along with the useful, but summary()
is usually a good first step in any exploratory data analysis, because it will give us an overall sense of what we’ve got in our dataframe:
summary(receipts)
## Date Title fl
## Min. :1791-04-26 I zingari in fiera : 5 Min. : 26.0
## 1st Qu.:1791-06-02 Il pazzo per forza :11 1st Qu.:117.8
## Median :1791-07-27 Il talismano : 3 Median :174.5
## Mean :1791-07-18 La bella pescatrice :20 Mean :203.3
## 3rd Qu.:1791-08-28 La molinara :12 3rd Qu.:235.0
## Max. :1791-10-04 La pastorella nobile: 5 Max. :556.0
## Le trame deluse : 4
## kr Total_kreuzer
## Min. : 2.00 Min. : 1605
## 1st Qu.:16.00 1st Qu.: 7092
## Median :31.00 Median :10514
## Mean :28.68 Mean :12228
## 3rd Qu.:40.25 3rd Qu.:14110
## Max. :58.00 Max. :33381
##
summary()
gives summary statistics for all five columns, making its best guess about what what kinds of statistics to return. Sometimes these guesses aren’t particularly helpful. The “summary” of the Date
column gives us the minimum, first quartile (25th percentile), median, mean, third quartile (75th percentile), and maximum of the dates. This isn’t very useful: the concept of the “mean” of a set of dates isn’t really meaningful, the “minimum” and “maximum” just tell us the first and last dates (which we wouldn’t normally call “minimum” and “maximum”), and the “median” gives us the date 1791-07-27, halfway between the two dates at the center of our data, 1791-07-25 and 1791-07-29. This would be fine, except that no opera was performed on that date, so this “median” doesn’t really tell us anything. It isn’t even the true chronological midpoint, because our dataset doesn’t include all of the dates between its endpoints: R finds the “midpoint” of the sequence of dates we’ve given it, not the true midpoint, which is 15 Jul. The statistics for gulden (fl) and kreuzer (kr) are technically correct, but again not quite what we actually need: we don’t want the means of the gulden and kreuzer columns separately, but rather the mean of the total box-office receipts from each performance—this is why we added the column Total_kreuzer
.
However, the summary of the Title
column is useful: it tells us how many times each opera was performed during our period of interest. We can immediately see that La bella pescatrice, which was premiered in Vienna on 26 Apr 1791 and was the first opera performance in the new season, was the most frequently performed opera over the next few months, given 20 times over 23 weeks. In contrast, Salieri’s Il talismano (which as we’ll see was long past its sell-by date with the Viennese public) was performed only 3 times. The seven operas in chronological order of their first performance during this period are:
The statistics in the column Total_kreuzer
in the summary above are also useful: they tell us that the minimum receipts from any single opera performance during this period were 1605 kr (26 fl 45 kr), the maximum were 33381 kr (556 fl 21 kr), the mean (average) receipts over the 60 performances were 12228 kr (203 fl 48 kr), and the median receipts (that is, the number exactly in the middle of the range of receipts) were 10514 kr (175 fl 14 kr).6
We can create a separate table showing the numbers of performances of each opera using two functions from dplyr
: group_by()
and tally()
:
receipts %>% group_by(Title) %>% tally()
## Source: local data frame [7 x 2]
##
## Title n
## 1 I zingari in fiera 5
## 2 Il pazzo per forza 11
## 3 Il talismano 3
## 4 La bella pescatrice 20
## 5 La molinara 12
## 6 La pastorella nobile 5
## 7 Le trame deluse 4
Notice that tally()
has automatically given the heading n
to the new column of counts (we could change this heading if we wished).
What is particularly nice about the dplyr
syntax using the %>%
operator is that we can easily extend expressions in an intuitive way. We can think of the %>%
operator in the previous expression as meaning “and then”: take the receipt
dataframe, and then group it by title, and then count the numbers in each group.
We can have tally
sort this table simply by setting the attribute sort
to TRUE
(in R, the logical values TRUE
and FALSE
must be given in uppercase):
receipts %>% group_by(Title) %>% tally(sort=TRUE)
## Source: local data frame [7 x 2]
##
## Title n
## 1 La bella pescatrice 20
## 2 La molinara 12
## 3 Il pazzo per forza 11
## 4 I zingari in fiera 5
## 5 La pastorella nobile 5
## 6 Le trame deluse 4
## 7 Il talismano 3
We can get the same results in a slightly different way incorporating the arrange()
function from dplyr
; we’ll often use arrange()
in what follows. Here the n
in the function desc()
is the name of the column created by tally()
:
receipts %>% group_by(Title) %>% tally() %>% arrange(desc(n))
## Source: local data frame [7 x 2]
##
## Title n
## 1 La bella pescatrice 20
## 2 La molinara 12
## 3 Il pazzo per forza 11
## 4 I zingari in fiera 5
## 5 La pastorella nobile 5
## 6 Le trame deluse 4
## 7 Il talismano 3
In base R, we can get the same information (albeit in a different format) by writing:
sort(table(receipts$Title), decreasing=TRUE)
##
## La bella pescatrice La molinara Il pazzo per forza
## 20 12 11
## I zingari in fiera La pastorella nobile Le trame deluse
## 5 5 4
## Il talismano
## 3
This illustrates one of the fundamental characteristics of working with R: there are usually many different ways to accomplish any particular task, and choosing among them is, in part, simply a matter of personal taste. This is both a strength and a weakness: R is incredibly rich and powerful because of this diversity, but that richness can often make it confusing for beginners. There are sometimes so many ways to do something that it can be difficult for a beginner to figure out how to do that thing at all. This diversity can easily be seen on the various R discussion lists, where a simple query about how to do something will often produce several answers that take quite different approaches to doing it. What can seem to a beginner (or even someone at an intermediate level) like R’s “blooming, buzzing confusion” is in stark contrast to the philosophy of a programming language such as Python, with its mantra:
There should be one—and preferably only one—obvious way to do it.7
But we can have it both ways: both R and Python are very handy to have in your toolkit if you’re aspiring to do work under the broad umbrella of the “digital humanities.”
To count the number of opera performances in each month using dplyr
, we can write:
receipts %>%
group_by(Month = month(Date)) %>%
tally()
## Source: local data frame [7 x 2]
##
## Month n
## 1 4 3
## 2 5 11
## 3 6 8
## 4 7 9
## 5 8 15
## 6 9 12
## 7 10 2
This uses the month()
function from the package lubridate
to extract the number of the month. Notice that we’re now indenting the elements of the expression on separate lines to make the sequence easier to read.
If we prefer to see an abbreviation of the month name instead of its number, we can write month(Date, label=TRUE, abbr=TRUE)
:
receipts %>%
group_by(Month = month(Date, label=TRUE, abbr=TRUE)) %>%
tally()
## Source: local data frame [7 x 2]
##
## Month n
## 1 Apr 3
## 2 May 11
## 3 Jun 8
## 4 Jul 9
## 5 Aug 15
## 6 Sep 12
## 7 Oct 2
We expect to see low counts for Apr and Oct, as our dataset covers just the last five days of Apr and the first four of Oct. So let’s exclude those rows from our result, using the filter()
function from dplyr
:
receipts %>%
filter(month(Date) %in% 5:9) %>%
group_by(Month = month(Date, label=TRUE, abbr=TRUE)) %>%
tally()
## Source: local data frame [5 x 2]
##
## Month n
## 1 May 11
## 2 Jun 8
## 3 Jul 9
## 4 Aug 15
## 5 Sep 12
This expression illustrates a useful trick: we tell filter()
to extract the number of the month, and then check whether the number is %in%
the sequence 5 through 9 (May through Sep), thus excluding months 4 and 10 (Apr and Oct).
It looks from this table as if the frequency of opera performances in the Burgtheater was significantly lower in Jun and Jul than in Aug and Sep. We can add a “Percent” column to check whether our intuition is correct. We will take “Percent” here to mean “the percentage of days in a given month on which an opera was performed.”
This is just slightly tricky to do, for a couple of reasons. For one thing, the months have different numbers of days, and we’ll need to account for those differences when we calculate the percentages. Fortunately, lubridate
includes the function days_in_month()
that will return the number of days in a month when called with the month number. In our new expression we will use the function summarize()
from dplyr
instead of tally()
in order to give our column of counts the more informative name “Counts.” We’ll again use mutate()
to add our new “Percent” column to the result. The percentage is calculated by dividing the number of opera performances (Count
) by the number of days in the month, and mulitplying by 100:
receipts %>%
filter(month(Date) %in% 5:9) %>%
group_by(Month = month(Date)) %>%
summarize(Count = n()) %>%
mutate(Percent = Count/days_in_month(Month)*100)
## Source: local data frame [5 x 3]
##
## Month Count Percent
## 1 5 11 35.48387
## 2 6 8 26.66667
## 3 7 9 29.03226
## 4 8 15 48.38710
## 5 9 12 40.00000
We don’t really need to see the percentages to five decimal places, so we will reduce this to two using the round()
function from base R with the attribute digits=2
, embedding our previous formula for calculating Percent
as the argument to round()
:
receipts %>%
filter(month(Date) %in% 5:9) %>%
group_by(Month = month(Date)) %>%
summarize(Count = n()) %>%
mutate(Percent = round(Count/days_in_month(Month)*100, digits=2))
## Source: local data frame [5 x 3]
##
## Month Count Percent
## 1 5 11 35.48
## 2 6 8 26.67
## 3 7 9 29.03
## 4 8 15 48.39
## 5 9 12 40.00
Notice that we’ve gone back to using the month number in the column Month
. It turns out to be slightly tricky to retain the abbreviation for the month using our expression built with dplyr
. We can’t simply keep month(Date, label=TRUE, abbr=TRUE)
in the group_by()
function as we previously did, because the variable Month
still needs to be a number by the time we calculate Percent
inside the mutate()
function at the end of the expression. If Month
is equal to the abbreviation at that point, R won’t be smart enough to convert it back into a number, and will return an error.
So we have to use a second trick, this time using the match()
function from base R, and the built-in vector of month abbreviations month.abb
, also from base R. Now we can return to using month(Date, label=TRUE, abbr=TRUE)
in group_by()
, and use match(Month, month.abb)
to give us the month number again when we need it in the calculation of Percent
.
The expression match(Month, month.abb)
takes the abbreviation that is now in the variable Month
, and finds a match for it in the vector month.abb
. If Month
is, for example, “Aug”, then match(Month, month.abb)
returns 8, because “Aug” is the 8th abbreviation in month.abb
. We then pass that number to days_in_month()
as we did before. It looks superficially more complicated, but when we build it up a step at a time, we can see that it isn’t really—although it may take a little thought to figure out how to do it.
receipts %>%
filter(month(Date) %in% 5:9) %>%
group_by(Month=month(Date, label=TRUE, abbr=TRUE)) %>%
summarize(Count=n()) %>%
mutate(Percent=round(Count/days_in_month(match(Month,month.abb))*100,
digits=2))
## Source: local data frame [5 x 3]
##
## Month Count Percent
## 1 May 11 35.48
## 2 Jun 8 26.67
## 3 Jul 9 29.03
## 4 Aug 15 48.39
## 5 Sep 12 40.00
These percentages show that the frequency of opera performances was indeed noticeably lower in Jun and Jul 1791 than in May, Aug, and Sep. Operas were given only slightly more often than once every four days on average in Jun (around 27% of the days), and just a bit more often than that in Jul.
It might be interesting, then, to calculate the number of days from one opera performance to the next for every pair of consecutive performances during our period, to see if we can spot any unusual gaps when no operas were performed. R and the add-on packages we’re using here make this easy to do. We again use the mutate()
function, this time to add a temporary column called Days
to show the interval in days between each successive pair of dates; we use the diff()
function from base R to calculate those differences. A difference of 2, for example, will show that the opera was performed on the second day after the preceding opera performance (in other words, differences of 2 opera performances every other day).
Notice that we make this calculation simply by writing diff(Date)
, which will find all 59 differences across the entire vector of 60 dates. This is an excellent example of one of R’s greatest strengths, its “vectorization,” which allows us to make calculations on long vectors very simply (I’ll say more about vectorization in the Conclusion). Since for our current calculation we are interested only in the dates and titles, not in the receipts, we use the select()
function from dplyr
to show just those two columns, plus our new Days
column.
Because this calculation will give us just 59 differences, R will complain if we try to add a column of length 59 to a dataframe with 60 rows, so we need to add the number 0 in the first row of the new Days
column. We can do this using R’s concatenation function c()
, and it makes sense to do so, because the first opera performance of the new season on 26 Apr 1791 didn’t have another opera preceding it that season, so the number of days since the previous opera performance is 0.
receipts %>%
select(Date:Title) %>%
mutate(Days = c(0, diff(Date)))
## Date Title Days
## 1 1791-04-26 La bella pescatrice 0
## 2 1791-04-28 La bella pescatrice 2
## 3 1791-04-30 La bella pescatrice 2
## 4 1791-05-02 Il talismano 2
## 5 1791-05-04 La bella pescatrice 2
## 6 1791-05-06 Il talismano 2
## 7 1791-05-08 La bella pescatrice 2
## 8 1791-05-10 La bella pescatrice 2
## 9 1791-05-12 La bella pescatrice 2
## 10 1791-05-14 La molinara 2
## 11 1791-05-16 La molinara 2
## 12 1791-05-18 La bella pescatrice 2
## 13 1791-05-20 La molinara 2
## 14 1791-05-22 Il talismano 2
## 15 1791-06-01 La molinara 10
## 16 1791-06-03 La bella pescatrice 2
## 17 1791-06-05 La molinara 2
## 18 1791-06-07 Il pazzo per forza 2
## 19 1791-06-17 Il pazzo per forza 10
## 20 1791-06-19 La bella pescatrice 2
## 21 1791-06-21 La bella pescatrice 2
## 22 1791-06-24 Il pazzo per forza 3
## 23 1791-07-05 Il pazzo per forza 11
## 24 1791-07-07 La molinara 2
## 25 1791-07-13 Le trame deluse 6
## 26 1791-07-15 Il pazzo per forza 2
## 27 1791-07-17 Il pazzo per forza 2
## 28 1791-07-19 La bella pescatrice 2
## 29 1791-07-23 Il pazzo per forza 4
## 30 1791-07-25 La molinara 2
## 31 1791-07-29 La bella pescatrice 4
## 32 1791-08-01 La bella pescatrice 3
## 33 1791-08-03 La bella pescatrice 2
## 34 1791-08-05 La pastorella nobile 2
## 35 1791-08-07 La pastorella nobile 2
## 36 1791-08-09 Il pazzo per forza 2
## 37 1791-08-11 La pastorella nobile 2
## 38 1791-08-13 Il pazzo per forza 2
## 39 1791-08-15 Le trame deluse 2
## 40 1791-08-18 La molinara 3
## 41 1791-08-20 La bella pescatrice 2
## 42 1791-08-22 La pastorella nobile 2
## 43 1791-08-24 Le trame deluse 2
## 44 1791-08-26 Il pazzo per forza 2
## 45 1791-08-28 La molinara 2
## 46 1791-08-30 La bella pescatrice 2
## 47 1791-09-01 Le trame deluse 2
## 48 1791-09-05 La molinara 4
## 49 1791-09-07 La bella pescatrice 2
## 50 1791-09-10 La molinara 3
## 51 1791-09-12 La bella pescatrice 2
## 52 1791-09-14 La pastorella nobile 2
## 53 1791-09-18 I zingari in fiera 4
## 54 1791-09-20 I zingari in fiera 2
## 55 1791-09-24 I zingari in fiera 4
## 56 1791-09-26 La bella pescatrice 2
## 57 1791-09-27 La molinara 1
## 58 1791-09-30 I zingari in fiera 3
## 59 1791-10-02 I zingari in fiera 2
## 60 1791-10-04 Il pazzo per forza 2
We can immediately see that operas most often were performed every other day, that is, with a difference in Date
equal to 2 days. In fact, we can easily use R to count the frequency of each interval. We first create a vector called Days
(mimicking the one that we just temporarily created in our result table) that records just the differences in days from one opera performance to the next:
Days <- diff(receipts$Date)
Days
## Time differences in days
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 10 2 2 2 10 2 2 3 11 2
## [24] 6 2 2 2 4 2 4 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2
## [47] 4 2 3 2 2 4 2 4 2 1 3 2 2
We then use base R’s table()
function to count the number of times each interval appears:
table(Days)
## Days
## 1 2 3 4 6 10 11
## 1 44 5 5 1 2 1
The first row of numbers shows the interval (1 day, 2 days, 3 days, and so on), and the second row shows the number of times that interval appears in our dataset. From this we can see that operas were performed every other day (that is, with an interval of 2 days) 44 times, and on the third or fourth day following the previous one 5 times each. Only once during the period were operas performed back to back, that is, with an interval of 1 day (on 26 and 27 Sep 1791, as we can see from the full table). There were four longer gaps: 10 days from the performance of Il talismano on 22 May to that of La molinara on 1 Jun; 10 days from the performance of Il pazzo per forza on 7 Jun to the next performance of the same opera on 17 Jun; 11 days from the performance of Il pazzo per forza on 24 Jun to the next of the same opera on 5 Jul; and 6 days from the performance of La molinara on 7 Jul to the performance of Le trame deluse on 13 Jul.
These unusual “gaps” suggest an obvious question: why did these gaps happen? And that is precisely the goal of this sort of exploratory data analysis: it brings to light patterns and relationships that may suggest questions that we might otherwise not have thought to ask.
We would especially like to know the explanation for the relatively long gaps of 10 and 11 days between opera performances in the summer of 1791; but to try to answer that question here would take us too far afield. And there are many other questions we could ask about the programming of operas: for example, were operas programmed more frequently on some days of the week than others? But the analysis here has given a good taste of what is possible, so we’ll now turn to the analysis of the receipts.
We start by producing a table of opera performances sorted by receipts in total kreuzer, from highest to lowest. This is easy to do with dplyr
functions that we’ve already met:
receipts %>% arrange(desc(Total_kreuzer))
## Date Title fl kr Total_kreuzer
## 1 1791-07-15 Il pazzo per forza 556 21 33381
## 2 1791-09-18 I zingari in fiera 536 16 32176
## 3 1791-04-26 La bella pescatrice 483 10 28990
## 4 1791-05-08 La bella pescatrice 479 34 28774
## 5 1791-05-14 La molinara 467 42 28062
## 6 1791-04-28 La bella pescatrice 418 42 25122
## 7 1791-05-04 La bella pescatrice 397 48 23868
## 8 1791-07-13 Le trame deluse 361 22 21682
## 9 1791-04-30 La bella pescatrice 325 11 19511
## 10 1791-05-16 La molinara 305 27 18327
## 11 1791-05-20 La molinara 269 37 16177
## 12 1791-05-18 La bella pescatrice 254 33 15273
## 13 1791-05-10 La bella pescatrice 251 58 15118
## 14 1791-09-27 La molinara 239 33 14373
## 15 1791-10-02 I zingari in fiera 238 32 14312
## 16 1791-09-05 La molinara 234 2 14042
## 17 1791-08-20 La bella pescatrice 232 53 13973
## 18 1791-06-21 La bella pescatrice 231 12 13872
## 19 1791-06-03 La bella pescatrice 227 23 13643
## 20 1791-08-05 La pastorella nobile 224 4 13444
## 21 1791-09-20 I zingari in fiera 222 36 13356
## 22 1791-09-26 La bella pescatrice 218 21 13101
## 23 1791-08-30 La bella pescatrice 215 35 12935
## 24 1791-07-23 Il pazzo per forza 208 40 12520
## 25 1791-08-22 La pastorella nobile 203 38 12218
## 26 1791-06-01 La molinara 199 13 11953
## 27 1791-06-07 Il pazzo per forza 187 10 11230
## 28 1791-07-07 La molinara 186 29 11189
## 29 1791-09-24 I zingari in fiera 180 54 10854
## 30 1791-08-18 La molinara 175 39 10539
## 31 1791-05-12 La bella pescatrice 174 49 10489
## 32 1791-09-07 La bella pescatrice 172 36 10356
## 33 1791-07-19 La bella pescatrice 171 57 10317
## 34 1791-07-25 La molinara 166 36 9996
## 35 1791-09-30 I zingari in fiera 162 18 9738
## 36 1791-08-28 La molinara 161 5 9665
## 37 1791-07-29 La bella pescatrice 160 42 9642
## 38 1791-09-10 La molinara 158 41 9521
## 39 1791-08-03 La bella pescatrice 156 55 9415
## 40 1791-06-05 La molinara 154 38 9278
## 41 1791-06-19 La bella pescatrice 151 22 9082
## 42 1791-06-17 Il pazzo per forza 149 42 8982
## 43 1791-09-14 La pastorella nobile 137 27 8247
## 44 1791-08-07 La pastorella nobile 127 7 7627
## 45 1791-10-04 Il pazzo per forza 118 30 7110
## 46 1791-09-01 Le trame deluse 117 20 7040
## 47 1791-08-24 Le trame deluse 111 21 6681
## 48 1791-08-11 La pastorella nobile 110 20 6620
## 49 1791-08-26 Il pazzo per forza 103 8 6188
## 50 1791-09-12 La bella pescatrice 99 8 5948
## 51 1791-08-15 Le trame deluse 94 35 5675
## 52 1791-05-02 Il talismano 84 41 5081
## 53 1791-08-01 La bella pescatrice 83 12 4992
## 54 1791-05-06 Il talismano 81 6 4866
## 55 1791-06-24 Il pazzo per forza 75 25 4525
## 56 1791-07-05 Il pazzo per forza 63 16 3796
## 57 1791-08-09 Il pazzo per forza 54 48 3288
## 58 1791-08-13 Il pazzo per forza 35 2 2102
## 59 1791-07-17 Il pazzo per forza 29 34 1774
## 60 1791-05-22 Il talismano 26 45 1605
We can get the same result in base R with the perhaps slightly less intuitive:
receipts[order(receipts$Total_kreuzer, decreasing=TRUE),]
Simple inspection of the table shows that two operas dominate the bottom tier: Salieri’s Il talismano and Joseph Weigl’s Il pazzo per forza, which occupy eight of the bottom nine places. The highest receipts for Il talismano are 84 fl 41 kr, in position 52. The paltry 26 fl 45 kr taken at the box office from the performance of Il talismano on 22 May 1791 suggest that the theater was nearly empty.8 It is not surprising, then, that the opera was dropped permanently from the repertory of the Burgtheater following this performance. Le trame deluse was also doing poorly: apart from its receipts of 361 fl 22 kr on 13 Jul 1791, its first performance in the Burgtheater since 28 Dec 1787, at number 8 in the ranking, its other three performances come in at numbers 46, 47, and 51. So its revival was clearly not a success with the Viennese public in 1791.
Extracting the rows for a particular opera is simple. In classic R, we write:
receipts[receipts$Title == "Il pazzo per forza",]
## Date Title fl kr Total_kreuzer
## 18 1791-06-07 Il pazzo per forza 187 10 11230
## 19 1791-06-17 Il pazzo per forza 149 42 8982
## 22 1791-06-24 Il pazzo per forza 75 25 4525
## 23 1791-07-05 Il pazzo per forza 63 16 3796
## 26 1791-07-15 Il pazzo per forza 556 21 33381
## 27 1791-07-17 Il pazzo per forza 29 34 1774
## 29 1791-07-23 Il pazzo per forza 208 40 12520
## 36 1791-08-09 Il pazzo per forza 54 48 3288
## 38 1791-08-13 Il pazzo per forza 35 2 2102
## 44 1791-08-26 Il pazzo per forza 103 8 6188
## 60 1791-10-04 Il pazzo per forza 118 30 7110
This means: from the receipts
dataframe select those rows in which Title
is equal to “Il pazzo per forza”. Not the most transparent syntax, and one that can easily cause headaches for beginners. (The symbol ==
means “is equivalent to,” and is used in True/False comparisons in R. It’s an important symbol to remember, as it is used frequently.)
The syntax using dplyr
seems more straightforward and intuitive:
receipts %>% filter(Title == "Il pazzo per forza")
## Date Title fl kr Total_kreuzer
## 1 1791-06-07 Il pazzo per forza 187 10 11230
## 2 1791-06-17 Il pazzo per forza 149 42 8982
## 3 1791-06-24 Il pazzo per forza 75 25 4525
## 4 1791-07-05 Il pazzo per forza 63 16 3796
## 5 1791-07-15 Il pazzo per forza 556 21 33381
## 6 1791-07-17 Il pazzo per forza 29 34 1774
## 7 1791-07-23 Il pazzo per forza 208 40 12520
## 8 1791-08-09 Il pazzo per forza 54 48 3288
## 9 1791-08-13 Il pazzo per forza 35 2 2102
## 10 1791-08-26 Il pazzo per forza 103 8 6188
## 11 1791-10-04 Il pazzo per forza 118 30 7110
In other words: filter
the receipts
dataframe, returning those rows where Title
is equal to “Il pazzo per forza.” To sort this table in order of descending receipts, we again use arrange()
:
receipts %>%
filter(Title == "Il pazzo per forza") %>%
arrange(desc(Total_kreuzer))
## Date Title fl kr Total_kreuzer
## 1 1791-07-15 Il pazzo per forza 556 21 33381
## 2 1791-07-23 Il pazzo per forza 208 40 12520
## 3 1791-06-07 Il pazzo per forza 187 10 11230
## 4 1791-06-17 Il pazzo per forza 149 42 8982
## 5 1791-10-04 Il pazzo per forza 118 30 7110
## 6 1791-08-26 Il pazzo per forza 103 8 6188
## 7 1791-06-24 Il pazzo per forza 75 25 4525
## 8 1791-07-05 Il pazzo per forza 63 16 3796
## 9 1791-08-09 Il pazzo per forza 54 48 3288
## 10 1791-08-13 Il pazzo per forza 35 2 2102
## 11 1791-07-17 Il pazzo per forza 29 34 1774
This seems to show that the receipts from Il pazzo per forza were suprisingly volatile, ranging from a high of 556 fl 21 on 15 Jul 1791 to a pitiful low of 29 fl 34 kr just two days later. Why this volatility?
It turns out that the high receipts on 15 Jul came under special circumstances: only the first act of Il pazzo per forza was performed, and it was coupled with the Viennese premiere of the “scena drammatica” Pimmalione, an Italian adaptation by Antonio Simone Sografi of Jean-Jacques Rousseau’s Pygmalion (1770), with music by Giovanni Batiste Cimador. So receipts shot up on 15 Jul when people flocked to see the premiere. Pimmalione was again paired with a partial performance of Il pazzo per forza on 23 Jul, with decent receipts of 208 fl 40 kr. But in general, Il pazzo per forza was doing poorly at the box office, and it fell permanently from the repertory after the performance on 4 Oct 1791, the last date in our dataset.
We have already seen from our earlier use of the summary()
function that the average box-office receipts for all operas performed in the Burgtheater over these 23 weeks was 12227.68 kr per performance. To convert kreuzer back to gulden (fl) and kreuzer (kr), we need to use two simple formulas. In the first, we use what is called “integer division” to find the number of whole gulden in 12227.68 kr, given that there were 60 kreuzer in a gulden. Integer division tells us how many times one integer will divide into a given number, ignoring the remainder. The number 60 goes into 12227.68 203 times, so 203 is the number of gulden in 12227.68 kr. The remainder from that division is equal to the number of kreuzer less than one gulden. We capture this remainder using the “modulo” operator, often abbreviated mod
. So 12227.68 mod 60
is equal to 48 (rounded to the nearest kreuzer), the remainder when we divide 12227.68 by 60. Thus the average of 12227.68 kr per performance is equal to 203 fl 48 kr, rounded to the nearest kreuzer.
The symbols for these two operations in R are not among its more mnemonic or elegant: the operator for integer division is %/%
and the operator for modular division is %%
(by way of comparison, the symbols for these operations in the Python programming language are the simpler //
and %
respectively).
If we have to make a lot of these conversions, it is convenient and fairly simple to write a little function in R to do it:
gulden <- function(x){cat(x %/% 60, "fl", round(x %% 60), "kr\n")}
This is less intimidating than it may look. It states simply that when we call our function gulden()
with the number x
(say, for example, 12227.68), it will calculate:
x %/% 60
(integer division, for the number of gulden), andx %% 60
(modulo division, giving the number of remaining kreuzer), rounding up to the nearest kreuzer.We then paste the whole thing together using the function cat()
(for concatenating strings) with the abbreviations “fl” and “kr” to give us our formatted answer, 203 fl 48 kr
.
Here it is in action:
gulden(12227.68)
## 203 fl 48 kr
As I show in my commentary to the anonymous correspondent’s report to the Bayreuther Zeitung, an average of 203 fl 48 kr is in line with the long-term average for operas in the Burgtheater over the preceding two seasons. The average receipts for all operas performed in the Burgtheater in the seasons 1789-90 and 1790-91 was 196 fl 10 kr, and the average for operas from 26 Apr to 4 Oct 1791 is actually a bit higher than this. Can we find any evidence that the average receipts from operas declined during the weeks closer to the anonymous correspondent’s report, evidence that may suggest audiences were in fact staying away from the Burghteater?
To look at a particular chronological slice of our dataset, we again use the filter()
function from dplyr
, along with the operators >=
meaning “greater than or equal to” and <=
, “less than or equal to.” To select the rows from 1 Aug to 4 Oct, we write:
receipts %>%
filter(Date >= as.Date("1791-08-01"))
## Date Title fl kr Total_kreuzer
## 1 1791-08-01 La bella pescatrice 83 12 4992
## 2 1791-08-03 La bella pescatrice 156 55 9415
## 3 1791-08-05 La pastorella nobile 224 4 13444
## 4 1791-08-07 La pastorella nobile 127 7 7627
## 5 1791-08-09 Il pazzo per forza 54 48 3288
## 6 1791-08-11 La pastorella nobile 110 20 6620
## 7 1791-08-13 Il pazzo per forza 35 2 2102
## 8 1791-08-15 Le trame deluse 94 35 5675
## 9 1791-08-18 La molinara 175 39 10539
## 10 1791-08-20 La bella pescatrice 232 53 13973
## 11 1791-08-22 La pastorella nobile 203 38 12218
## 12 1791-08-24 Le trame deluse 111 21 6681
## 13 1791-08-26 Il pazzo per forza 103 8 6188
## 14 1791-08-28 La molinara 161 5 9665
## 15 1791-08-30 La bella pescatrice 215 35 12935
## 16 1791-09-01 Le trame deluse 117 20 7040
## 17 1791-09-05 La molinara 234 2 14042
## 18 1791-09-07 La bella pescatrice 172 36 10356
## 19 1791-09-10 La molinara 158 41 9521
## 20 1791-09-12 La bella pescatrice 99 8 5948
## 21 1791-09-14 La pastorella nobile 137 27 8247
## 22 1791-09-18 I zingari in fiera 536 16 32176
## 23 1791-09-20 I zingari in fiera 222 36 13356
## 24 1791-09-24 I zingari in fiera 180 54 10854
## 25 1791-09-26 La bella pescatrice 218 21 13101
## 26 1791-09-27 La molinara 239 33 14373
## 27 1791-09-30 I zingari in fiera 162 18 9738
## 28 1791-10-02 I zingari in fiera 238 32 14312
## 29 1791-10-04 Il pazzo per forza 118 30 7110
In fact, we can get away with omitting as.Date()
, making the expression a bit easier to type and to read:
receipts %>%
filter(Date >= "1791-08-01")
## Date Title fl kr Total_kreuzer
## 1 1791-08-01 La bella pescatrice 83 12 4992
## 2 1791-08-03 La bella pescatrice 156 55 9415
## 3 1791-08-05 La pastorella nobile 224 4 13444
## 4 1791-08-07 La pastorella nobile 127 7 7627
## 5 1791-08-09 Il pazzo per forza 54 48 3288
## 6 1791-08-11 La pastorella nobile 110 20 6620
## 7 1791-08-13 Il pazzo per forza 35 2 2102
## 8 1791-08-15 Le trame deluse 94 35 5675
## 9 1791-08-18 La molinara 175 39 10539
## 10 1791-08-20 La bella pescatrice 232 53 13973
## 11 1791-08-22 La pastorella nobile 203 38 12218
## 12 1791-08-24 Le trame deluse 111 21 6681
## 13 1791-08-26 Il pazzo per forza 103 8 6188
## 14 1791-08-28 La molinara 161 5 9665
## 15 1791-08-30 La bella pescatrice 215 35 12935
## 16 1791-09-01 Le trame deluse 117 20 7040
## 17 1791-09-05 La molinara 234 2 14042
## 18 1791-09-07 La bella pescatrice 172 36 10356
## 19 1791-09-10 La molinara 158 41 9521
## 20 1791-09-12 La bella pescatrice 99 8 5948
## 21 1791-09-14 La pastorella nobile 137 27 8247
## 22 1791-09-18 I zingari in fiera 536 16 32176
## 23 1791-09-20 I zingari in fiera 222 36 13356
## 24 1791-09-24 I zingari in fiera 180 54 10854
## 25 1791-09-26 La bella pescatrice 218 21 13101
## 26 1791-09-27 La molinara 239 33 14373
## 27 1791-09-30 I zingari in fiera 162 18 9738
## 28 1791-10-02 I zingari in fiera 238 32 14312
## 29 1791-10-04 Il pazzo per forza 118 30 7110
How can we find the mean of the performances for this shorter period? Given the “chained” expressions we’ve seen in dplyr
using the %>%
operator, it might seem natural to think (as I did at first) that we could write something like this:
receipts %>%
filter(Date >= "1791-08-01") %>%
mean(Total_kreuzer)
## Warning in mean.default(., Total_kreuzer): argument is not numeric or
## logical: returning NA
## [1] NA
But this obviously doesn’t work. The proper syntax9 is:
receipts %>%
filter(Date >= "1791-08-01") %>%
summarize(Mean = mean(Total_kreuzer))
## Mean
## 1 10190.9
The “1” at the left before “10190.9” is a row number, since dplyr
returns its result as a tiny dataframe, with one row and one column.
We can easily extend this expression to incude a conversion to gulden and kreuzer using the formulas we just developed (I’ve also added round()
to round the mean to the nearest kreuzer, which is close enough for musicology):
receipts %>%
filter(Date >= "1791-08-01") %>%
summarize(Mean = round(mean(Total_kreuzer)),
fl = Mean %/% 60,
kr = Mean %% 60)
## Mean fl kr
## 1 10191 169 51
(Notice that once you’ve calculated the mean and assigned it to the label Mean
, you can refer to that label in subsequent calculations.)
So yes, the average receipts over the months of Aug and Sep 1791 were considerably lower than the long-term average for the entire period, 203 fl 48 kr. What about the average receipts for Aug alone? (In the following expression, &
is R’s symbol for logical “and.” Notice that because we have now worked out the basic syntax for this sort of query, we can simply copy the parts that are unchanged into our subsequent ones.)
receipts %>%
filter(Date >= "1791-08-01" & Date <= "1791-08-31") %>%
summarize(Mean = round(mean(Total_kreuzer)),
fl = Mean %/% 60,
kr = Mean %% 60)
## Mean fl kr
## 1 8357 139 17
The average receipts from operas were very low in Aug.
We can also achieve the same result with a little less typing using the function month()
from lubridate
:
receipts %>%
filter(month(Date) == 8) %>%
summarize(Mean = round(mean(Total_kreuzer)),
fl = Mean %/% 60,
kr = Mean %% 60)
## Mean fl kr
## 1 8357 139 17
However, we know from the tables we’ve seen so far that two of the operas performed in Aug, Il pazzo per forza (given three times that month) and Le trame deluse (given twice) were doing poorly at the box-office and they probably drove the average down. What were the average receipts in Aug if we remove these two operas from the calculation? (In the following, !=
is R’s symbol for “not equal.” So the argument to the filter()
function reads: “filter the receipts
dataset, retaining only those performances in Aug where Title
is not equal to”Il pazzo per forza" and Title
is not equal to “Le trame deluse”.)
receipts %>%
filter(month(Date) == 8 &
Title != "Il pazzo per forza" &
Title != "Le trame deluse") %>%
summarize(Mean = round(mean(Total_kreuzer)),
fl = Mean %/% 60,
kr = Mean %% 60)
## Mean fl kr
## 1 10143 169 3
So the average receipts from the other three operas performed in Aug were not quite so bad, but still below the long-term average for the season.
We can use the same technique to find the average receipts for a particular title:
receipts %>%
filter(Title == "La bella pescatrice") %>%
summarize(Mean = round(mean(Total_kreuzer)),
fl = Mean %/% 60,
kr = Mean %% 60)
## Mean fl kr
## 1 14721 245 21
La bella pescatrice was doing quite well overall, with average receipts of 245 fl 21 kr in 20 performances across the 23-week period, well above the average for all operas. It even did relatively well, if not spectacularly, during the slump in Aug:
receipts %>%
filter(Title == "La bella pescatrice" & month(Date) == 8) %>%
summarize(Mean = round(mean(Total_kreuzer)),
fl = Mean %/% 60,
kr = Mean %% 60)
## Mean fl kr
## 1 10329 172 9
We don’t have to use dplyr
for these calculations. There are a number of other ways in base R to apply a function like mean()
to a subset of data. One way is to write:
mean(receipts$Total_kreuzer[receipts$Title == "La bella pescatrice"])
## [1] 14721.05
That works, but the syntax seems unnecessarily opaque, because of the embedding and the mixing of parentheses and square brackets. Using subset()
from base R is somewhat easier to read:
mean(subset(receipts, Title == "La bella pescatrice")$Total_kreuzer)
## [1] 14721.05
This expression reads: take the subset()
of the dataframe receipts
for which the Title
is “La bella pescatrice,” then take the mean()
of the column Total_kreuzer
in that subset. (As you can see, the embedded functions are read from the inside out.)
We can use our previously defined function gulden()
to convert this to gulden and kreuzer:
gulden(mean(subset(receipts, Title=="La bella pescatrice")$Total_kreuzer))
## 245 fl 21 kr
This is a good demonstration of what I said earlier about “multiply embedded” expressions in R. Once you know the syntax, this is easy to type, but it is rather less easy for readers to parse visually, with its three sets of embedded parentheses.
It is simple to use dplyr
functions to produce a table of the average receipts for all seven titles:
receipts %>%
select(Date, Title, Total_kreuzer) %>%
group_by(Title) %>%
summarize(Mean = round(mean(Total_kreuzer)),
fl = Mean %/% 60,
kr = Mean %% 60)
## Source: local data frame [7 x 4]
##
## Title Mean fl kr
## 1 I zingari in fiera 16087 268 7
## 2 Il pazzo per forza 8627 143 47
## 3 Il talismano 3851 64 11
## 4 La bella pescatrice 14721 245 21
## 5 La molinara 13594 226 34
## 6 La pastorella nobile 9631 160 31
## 7 Le trame deluse 10270 171 10
To sort this little table by descending order of average receipts, we simply append %>% arrange(desc(Mean))
to the end of the previous expression:
receipts %>%
select(Date, Title, Total_kreuzer) %>%
group_by(Title) %>%
summarize(Mean = round(mean(Total_kreuzer)),
fl = Mean %/% 60,
kr = Mean %% 60) %>%
arrange(desc(Mean))
## Source: local data frame [7 x 4]
##
## Title Mean fl kr
## 1 I zingari in fiera 16087 268 7
## 2 La bella pescatrice 14721 245 21
## 3 La molinara 13594 226 34
## 4 Le trame deluse 10270 171 10
## 5 La pastorella nobile 9631 160 31
## 6 Il pazzo per forza 8627 143 47
## 7 Il talismano 3851 64 11
Not surprisingly, Paisiello’s I zingari in fiera, which had its Viennese premiere on 18 Sep 1791, late in our period of interest, has the highest average receipts; it was the first new opera performed in the Burgtheater since 26 Apr.
difftime("1791-09-18", "1791-04-26", unit = "days")
## Time difference of 145 days
A long time for a notoriously novelty hungry Viennese public to have to wait for a new opera.
There are many other questions we could pose to this little dataset, but let’s end this section by calculating the average opera receipts by month, using the functions we’re already familiar with from dply
and lubridate
:
receipts %>%
group_by(Month = month(Date, label=TRUE, abbr=TRUE)) %>%
summarize(Mean = round(mean(Total_kreuzer)),
fl = Mean %/% 60,
kr = Mean %% 60)
## Source: local data frame [7 x 4]
##
## Month Mean fl kr
## 1 Apr 24541 409 1
## 2 May 15240 254 0
## 3 Jun 10321 172 1
## 4 Jul 12700 211 40
## 5 Aug 8357 139 17
## 6 Sep 12396 206 36
## 7 Oct 10711 178 31
We can ignore Apr and Oct, for which our dataset contains only a few days (the average for Apr is unusually high because it reflects the first three performances of La bella pescatrice, a new opera for the Viennese public, and one that remained in the repertory until 1799). Looking at the rest of the months, there is no obviously consistent downward trend; in fact, the average receipts for opera performances in Sep recover considerably from a dip in Aug.
A wide variety of interesting and potentially informative graphs can be produced using this dataset, but I’ll end this analysis with five very simple ones. First we plot the receipts from the operas over time, with dots representing the receipts on any particular date. We’ll use the qplot()
(quick plot) function from the add-on package ggplot2
; ggplot2
is very widely used today in the world of R graphing and is very powerful and flexible, something our five plots will only hint at.
qplot(Date, Total_kreuzer, data=receipts,
geom="point",
main="Opera Receipts, Burgtheater, 26 Apr to 4 Oct 1791",
xlab="1791",
ylab="Total kreuzer")
The function call to qplot()
says that we want to graph Date
on the x axis and Total_kreuzer
on the y axis, using data from the receipts
dataframe. We then set several attirbutes. The kind of plot (the geom
in the lingo of ggplot2
) is point
(that is, we’re drawing dots), the main
title is as shown, the label for the x axis is given by xlab
, and that for the y axis by ylab
.
This plot makes clear that the receipts were quite variable. We can plot lines between the points to see if that might help us discern a trend. The only thing we have to change in our call to qplot()
is to give geom="line"
in place of geom="point"
.
qplot(Date, Total_kreuzer, data=receipts,
geom="line",
main="Opera Receipts, Burgtheater, 26 Apr to 4 Oct 1791",
xlab="1791",
ylab="Total kreuzer")
We see a large spike in Jul, corresponding to the premiere of Pimmalione, and another in Sep, corresponding to the premiere of I zingari in fiera, but otherwise it’s difficult to discern much of an overall trend. So perhaps the anonymous correspondent to the Bayreuther Zeitung was exaggerating in saying that the Viennese public had “forgotten” about the Burgtheater because of the popular new works in the suburban theaters, including the incipient smash hit Die Zauberflöte.
There is another plot, however, that does seem to show an overall downward trend over these 23 weeks, a plot of the “cumulative mean.” The cumulative mean is just what is sounds like: the mean of all performances from the starting date up to and including each date in succession: thus the “mean” of the first performance (equal simply to the receipts from that performance), then the mean of the first two, then the first three, the first four, and so on. Here is the plot:
qplot(Date, cummean(Total_kreuzer), data=receipts,
geom="line",
main="Cumulative Mean, Opera Receipts, Burgtheater",
xlab = "1791",
ylab="Cumulative Mean")
This plot shows a very clear downward trend in the cumulative average of receipts from operas in the Burgtheater between the premiere of La bella pescatrice on 26 Apr 1791 to the not very well attended performance of Il pazzo per forza on 4 Oct. To put it simply: the overall average of the receipts continued to sink as the number of performances increased. So perhaps the anonymous correspondent was onto something after all.
Finally two plots to give a taste of the attractive results possible using ggplot2
. First a quick plot using geom="smooth"
to show a smoothed trend line for the receipts, with a gray band to show a margin of error on either side.
qplot(Date, Total_kreuzer, data=receipts,
geom="smooth",
main="Opera Receipts, Burghteater, 26 Apr to 4 Oct 1791",
xlab="1791",
ylab="Total kreuzer")
This plot shows a mild recovery from mid Aug into early Oct, where our dataset ends.
And finally a histogram using the more powerful function ggplot()
. The vertical rectangles in a histogram are called “bins,” and the height of each bin in this case is proportional to the number of performances that fall into it. ggplot()
also allows us to vary the color of the bin by the number of performances in each bin.
ggplot(data=receipts, aes(Total_kreuzer)) +
geom_histogram(col="grey",
aes(fill=..count..)) +
labs(title="Opera Receipts, Burgtheater, 26 Apr to 4 Oct 1791") +
labs(x = "Total kreuzer", y = "Count")
The histogram shows a peak around 10000 kr, which tells us that this was the most frequent level for the box-office receipts over this period. Much more could be said about the distribution of the receipts, but I’ll leave that for another time.
This analysis has only scratched the surface of what can be done with R, even with a simple dataset like the one used here. Because the R language runs interactively, once you are familiar with R’s functions and syntax, you can begin to “ask questions” of your data as you think of them, simply by typing a expression at the console prompt to see what happens. That process of interactive exploratory data analysis can be extraordinarily productive and help stimulate new ways of thinking about your data and new questions to ask it.
R, largely justifiably, has a reputation for having a long learning curve. Its syntax can sometimes be opaque and it is the sort of language where the details of the syntax may fade from memory if you don’t use it every day. Its documentation is not aimed at beginners, and can be frustrating to deal with. Even with dplyr
, expressions can easily expand to the point of seeming intimidating. Given this complexity, why has it become so wildly popular, and why, in particular, should a mathematically-challenged scholar in the humanities think of downloading, learning, and using it?
There are many compelling reasons, of which I’ll touch only on a few that contributed to making me a convert and advocate:
R is free, extremely easy to download and install, and it runs on every major operating system. It even runs on iOS.
It is extremely powerful, and its capabilities are being extended all the time by contributed “packages” (such as dplyr
, lubridate
, and ggplot2
) that are simple to install. As of this writing, there are 6408 packages in the CRAN repository (CRAN stands for “Comprehensive R Archive Network”). These packages cover just about every possible specialized use you can think of, and this number does not include the 934 packages available for R through Bioconductor, where the focus is on working with genomic data.
Because of its power and because it is free, R has become extremely popular. By some rankings it is coming close to cracking the top 10 programming languages overall (for example, it is currently ranked 13th here). There are many excellent reference and tutorial sites for R online, and literally hundreds of books, as a quick search on Amazon will show. R is very widely used in academia, across a very wide range of fields.
R is becoming a key tool in the toolbox for scholars in the digital humanities. To take just one example, see the website and work of Matthew Jockers, currently Associate Professor of English at the University of Nebraska, Lincoln, and one of the co-founders, with Franco Moretti, of the Stanford Literary Lab. One of Jockers’ books is Text Analysis with R for Students of Literature.
R is legendary for its powerful and sophisticated graphics capabilities, of which I’ve given just a small taste in the graphs produced here.
R has inspired the wonderful IDE (integrated development environement) RStudio, which is likewise simple to download and install, and a pleasure to use. And the RStudio team continues to create cool and useful new tools for R, such as R Markdown, which I used to create this document. In an R Markdown document, the chunks of R code are evaulated when the document is compiled and inserted into the document automatically. In other words, I didn’t have to type the results into the document myself, or make jpegs of the graphs and paste them in: everything was calculated “live” from the embedded chunks of code and inserted automatically. Extremely cool, very easy to learn and use, and essential for creating reproducible research. This document is the first I’ve produced using R Markdown, actually the first document of any length I’ve produced using Markdown of any flavor, and I got the hang of it in less than an hour.
In spite of its learning curve, R definitely has a “cool” factor. Its “vectorization” allows computations that would be quite complicated to express in many other leading programming languages to be done very simply. An example of the kind that hooked me:
To create a vector x
of numbers from 1 to 100, you write simply:
x <- 1:100
x
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## [52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## [69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
## [86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
(The numbers on the left in square bracets show the “index” in the vector of the item just to the right of the number in brackets. So for example, if we see the index [18]
at the left end of a row, that means that the item just to its right is the eighteenth item in the vector. R does not automatically print the result of an assignment statement like x <- 1:100
, so we type x
again at the prompt to see what x
contains.)
And now that we have that vector, we can perform operations on it just as if it were a single number. For example, we can square every number in the vector simply by writing:
x ^ 2
## [1] 1 4 9 16 25 36 49 64 81 100 121
## [12] 144 169 196 225 256 289 324 361 400 441 484
## [23] 529 576 625 676 729 784 841 900 961 1024 1089
## [34] 1156 1225 1296 1369 1444 1521 1600 1681 1764 1849 1936
## [45] 2025 2116 2209 2304 2401 2500 2601 2704 2809 2916 3025
## [56] 3136 3249 3364 3481 3600 3721 3844 3969 4096 4225 4356
## [67] 4489 4624 4761 4900 5041 5184 5329 5476 5625 5776 5929
## [78] 6084 6241 6400 6561 6724 6889 7056 7225 7396 7569 7744
## [89] 7921 8100 8281 8464 8649 8836 9025 9216 9409 9604 9801
## [100] 10000
If we have another vector of the same length, we can add every corresponding pair if numbers in the two vectors simply by writing x + y
. As a demonstration, let’s generate a vector y
of the numbers from 1 to 100 in random order (also very easy to do), and then add it to the vector x
:
y <- sample(1:100, 100)
y
## [1] 32 3 83 11 71 81 41 70 13 24 28 78 73 47 1 100 30
## [18] 68 60 35 99 75 53 54 63 58 22 77 10 96 39 43 5 23
## [35] 29 34 46 80 93 82 57 89 37 7 55 84 52 91 87 27 94
## [52] 42 18 72 50 20 62 6 44 56 2 26 45 38 8 92 74 36
## [69] 25 90 66 21 33 4 95 17 97 88 48 40 16 51 86 59 49
## [86] 76 19 15 61 12 14 64 79 69 85 65 9 98 31 67
x + y
## [1] 33 5 86 15 76 87 48 78 22 34 39 90 86 61 16 116 47
## [18] 86 79 55 120 97 76 78 88 84 49 105 39 126 70 75 38 57
## [35] 64 70 83 118 132 122 98 131 80 51 100 130 99 139 136 77 145
## [52] 94 71 126 105 76 119 64 103 116 63 88 108 102 73 158 141 104
## [69] 94 160 137 93 106 78 170 93 174 166 127 120 97 133 169 143 134
## [86] 162 106 103 150 102 105 156 172 163 180 161 106 196 130 167
Calculate the product of each corresponding pair of numbers in the vectors x
and y
:
x * y
## [1] 32 6 249 44 355 486 287 560 117 240 308 936 949 658
## [15] 15 1600 510 1224 1140 700 2079 1650 1219 1296 1575 1508 594 2156
## [29] 290 2880 1209 1376 165 782 1015 1224 1702 3040 3627 3280 2337 3738
## [43] 1591 308 2475 3864 2444 4368 4263 1350 4794 2184 954 3888 2750 1120
## [57] 3534 348 2596 3360 122 1612 2835 2432 520 6072 4958 2448 1725 6300
## [71] 4686 1512 2409 296 7125 1292 7469 6864 3792 3200 1296 4182 7138 4956
## [85] 4165 6536 1653 1320 5429 1080 1274 5888 7347 6486 8075 6240 873 9604
## [99] 3069 6700
And calculate the sum of those products:
sum(x * y)
## [1] 256329
This ease of writing what are actually quite extensive calculations is where R really shines.
And the ability to generate long sequences in this simple way isn’t limited to numbers. For example, to generate a list of dates from 1791-04-26 to 1791-10-04, we write:
seq(as.Date("1791-04-26"), as.Date("1791-10-04"), by="day")
## [1] "1791-04-26" "1791-04-27" "1791-04-28" "1791-04-29" "1791-04-30"
## [6] "1791-05-01" "1791-05-02" "1791-05-03" "1791-05-04" "1791-05-05"
## [11] "1791-05-06" "1791-05-07" "1791-05-08" "1791-05-09" "1791-05-10"
## [16] "1791-05-11" "1791-05-12" "1791-05-13" "1791-05-14" "1791-05-15"
## [21] "1791-05-16" "1791-05-17" "1791-05-18" "1791-05-19" "1791-05-20"
## [26] "1791-05-21" "1791-05-22" "1791-05-23" "1791-05-24" "1791-05-25"
## [31] "1791-05-26" "1791-05-27" "1791-05-28" "1791-05-29" "1791-05-30"
## [36] "1791-05-31" "1791-06-01" "1791-06-02" "1791-06-03" "1791-06-04"
## [41] "1791-06-05" "1791-06-06" "1791-06-07" "1791-06-08" "1791-06-09"
## [46] "1791-06-10" "1791-06-11" "1791-06-12" "1791-06-13" "1791-06-14"
## [51] "1791-06-15" "1791-06-16" "1791-06-17" "1791-06-18" "1791-06-19"
## [56] "1791-06-20" "1791-06-21" "1791-06-22" "1791-06-23" "1791-06-24"
## [61] "1791-06-25" "1791-06-26" "1791-06-27" "1791-06-28" "1791-06-29"
## [66] "1791-06-30" "1791-07-01" "1791-07-02" "1791-07-03" "1791-07-04"
## [71] "1791-07-05" "1791-07-06" "1791-07-07" "1791-07-08" "1791-07-09"
## [76] "1791-07-10" "1791-07-11" "1791-07-12" "1791-07-13" "1791-07-14"
## [81] "1791-07-15" "1791-07-16" "1791-07-17" "1791-07-18" "1791-07-19"
## [86] "1791-07-20" "1791-07-21" "1791-07-22" "1791-07-23" "1791-07-24"
## [91] "1791-07-25" "1791-07-26" "1791-07-27" "1791-07-28" "1791-07-29"
## [96] "1791-07-30" "1791-07-31" "1791-08-01" "1791-08-02" "1791-08-03"
## [101] "1791-08-04" "1791-08-05" "1791-08-06" "1791-08-07" "1791-08-08"
## [106] "1791-08-09" "1791-08-10" "1791-08-11" "1791-08-12" "1791-08-13"
## [111] "1791-08-14" "1791-08-15" "1791-08-16" "1791-08-17" "1791-08-18"
## [116] "1791-08-19" "1791-08-20" "1791-08-21" "1791-08-22" "1791-08-23"
## [121] "1791-08-24" "1791-08-25" "1791-08-26" "1791-08-27" "1791-08-28"
## [126] "1791-08-29" "1791-08-30" "1791-08-31" "1791-09-01" "1791-09-02"
## [131] "1791-09-03" "1791-09-04" "1791-09-05" "1791-09-06" "1791-09-07"
## [136] "1791-09-08" "1791-09-09" "1791-09-10" "1791-09-11" "1791-09-12"
## [141] "1791-09-13" "1791-09-14" "1791-09-15" "1791-09-16" "1791-09-17"
## [146] "1791-09-18" "1791-09-19" "1791-09-20" "1791-09-21" "1791-09-22"
## [151] "1791-09-23" "1791-09-24" "1791-09-25" "1791-09-26" "1791-09-27"
## [156] "1791-09-28" "1791-09-29" "1791-09-30" "1791-10-01" "1791-10-02"
## [161] "1791-10-03" "1791-10-04"
(In this case, we have to use as.Date
or else R won’t know what we want, and will return an error.) This is, in fact, exactly the expression I used as I was writing this document in order to find the true chronological midpoint of the 23-week period we’ve been analyzing. All I had to do was to wrap R’s median
function around the expression that generates the sequence (I didn’t even have to save the sequence in order to do this):
median(seq(as.Date("1791-04-26"), as.Date("1791-10-04"), by="day"))
## [1] "1791-07-15"
Much easier than looking at a calendar and counting by hand.
This document was created as part of my own learning process for R Markdown and the dplyr
package. Its immediate motivation was to show my analysis of the box-office receipts in the old Burgtheater, as an appendix to a detailed commentary on the report that I’ve referred to above from the Bayreuther Zeitung, an important new document for Mozart scholarship.
But I’ve hoped also to take this opportunity to try to show scholars in music history, theater history, and the humanities more generally how R can be used to open up new questions for research and to help answer those questions. And it doesn’t take any sophistication in statistics or math to make R useful. The most complicated calculations we’ve use here were integer and modulo division. The only statistical concepts we used were simple counting and averages. It is very easy to extend the techniques here to calculate variances, standard deviations, and the like, but I’ll leave that for another time.
I welcome any feedback or questions, and please let me know if you find any errors in this document. You can write to me at dexedge@gmail.com.
A pdf of this document can be downloaded here.
For a transcription of the correspondent’s report in the Bayreuther Zeitung, along with a detailed commentary, see this page at the site Mozart: New Documents, a collaborative project of Dexter Edge and David Black focusing especially on previously unknown Mozart documents that have come to light in new digitized repositories such as Google Books. The portion of the commentary corresponding to the analysis in this document is under the heading “The Viennese Theatrical Scene.”↩
The original ledger book in which the box-office receipts are recorded is found in the library of the Österreichischen Theatermuseums in Vienna, A-Wkm (Kunsthistorisches Museum), M 4000.↩
In this document I’ll be using a few technical terms, some of which have particular precise meanings in R; it will be helpful to have a basic understanding of these terms in order to follow the analysis here. In R, a vector
is an ordered collection of objects of a single class
. Classes include such basic kinds as integer
(the numbers for gulden and kreuzer, for example) or character
(‘a’, ‘B’, ‘#’, ‘ß’), as well as more complex structured objects made up of simpler kinds, such as Date
, matrix
, or dataframe
. You can informally think of a vector in R as a particular kind of list (one in which all the elements are of the same class), but the term list
has a special meaning in R: it is an ordered collection of objects, each of which may be of any class, unlike an R vector
, which can only contain objects of the same class. So as you become familiar with R, it is best to use the terms vector
and list
as R thinks of them, in order to avoid confusion.
A dataframe
in R is similar to a spreadsheet or a ‘table’ in a database. A dataframe consists of one or more columns that are vectors—in other words, they contain objects all of the same class; columns in a dataframe must also all be the same length. We may sometimes refer to a column as a “variable,” particularly if it has a name (that is, a column heading). A dataframe consists of one or more rows, which are sometimes referred to as “observations,” a term derived from statistical analysis in the experimental sciences.
As an example, when we first load into R the receipts
dataframe that we’ll be analyzing here, its first three rows will look like this:
## Date Title fl kr
## 1 1791-04-26 La bella pescatrice 483 10
## 2 1791-04-28 La bella pescatrice 418 42
## 3 1791-04-30 La bella pescatrice 325 11
The numbers at the far left are the row numbers, and the four columns have headings, which can also be referred to as “variables” (because the values in the column may vary from row to row).
As in most programming languages, a “function” takes input (“arguments”), does something to that input, and (usually) returns a result. In R, arguments in function calls are placed between parentheses immediately following the name of the function. For example, the function call head(receipts)
will return the first few rows of the receipts
dataframe that we’ll be using here (this is the function I used to create the little example above). Arguments to a function may be of any class, depending on how the function is defined: numbers, vectors, dataframes, even other functions. A function call in R may often include settings for “attributes”; an example of an attribute is the label for the x axis of a graph, as we’ll see later on. Arithmetic and logical “operators” in R include such symbols as +
, *
, and &
, which behave as we would expect. As we’ll see, complex “expressions” can be constructed out of these basic building blocks of functions, operators, and arguments.↩
“Base R” refers to those elements of R that are automatically available when you start the program, and do not require loading any additional packages.↩
The terms factor
and levels
come from R’s heritage as a statistically-oriented language. For example, in survey data it is common to use a set of standardized responses, such as those in the famous Likert scale: Strongly disagree
, Disagree
, Neither agree nor disagree
, Agree
, and Strongly agree
. These are the “levels” of the Likert scale. It turns out that this model works well for data like standardized titles of operas and plays, so we leave the class of the Title
column as factor
.↩
Because the number of performances over the 23 weeks is even (60), the median is the average of the receipts at positions 30 and 31 when the receipts are ranked from highest to lowest. Thus no particular opera had receipts of 175 fl 14 kr.↩
The full mantra is actually: “There should be one—and preferably only one—obvious way to do it. Although that way may not be obvious at first unless you’re Dutch” (Guido van Rossum, the creator of Python, is Dutch).↩
For an analysis of the capacity of the old Burgtheater and a detailed discussion of what the box-office receipts can tell us about the size of its audiences, see Dexter Edge, “Mozart’s Reception in Vienna, 1787–1791,” in: Wolfgang Amadè Mozart: Essays on his Life and his Music, edited by Stanley Sadie. 66–117. Oxford: Clarendon Press, 1996.↩
I learned the proper syntax from a set of presentation slides on dplyr
by Randall J. Pruim here.↩