Five-Thirty-Eight actually has a dataset (https://fivethirtyeight.com/features/complete-catalog-curses-deaths-quentin-tarantino-films/) of all the swears and deaths that occur within some of Quentin Tarantino’s films (7 of them), including the time in the films that they occur. Let’s see if the frequency of swearing (more data than deaths) has any relationship to time elapsed within these films.
First, some data prep. What we’re given to begin with is a dataset of the movie, the ‘type’ of data (word or death; we select for word only), the word itself, and the timestamp it occured in mins. Included below is also a summary, in which we can see that some movies have alot more than others.
## movie type word minutes_in
## 1 Reservoir Dogs word dick 0.40
## 2 Reservoir Dogs word dicks 0.43
## 3 Reservoir Dogs word fucked 0.55
## 4 Reservoir Dogs word fucking 0.61
## 5 Reservoir Dogs word bullshit 0.61
## 6 Reservoir Dogs word fuck 0.66
## movie type word minutes_in
## Django Unchained :262 death: 0 fucking:407 Min. : 0.40
## Inglorious Basterds: 58 word :1704 shit :221 1st Qu.: 28.64
## Jackie Brown :368 fuck :213 Median : 67.29
## Kill Bill: Vol. 1 : 57 n-word :179 Mean : 69.63
## Kill Bill: Vol. 2 : 69 ass :140 3rd Qu.:111.91
## Pulp Fiction :469 goddamn:113 Max. :160.28
## Reservoir Dogs :421 (Other):431
Data Prep
We want to get this into a form we can use. We’ll do that by taking tallies of the swears that occur in each minute of film. We also need to keep track of the amount of movies that are running because not all have the same length, and this would bias (more swears) towards earlier minutes.
What we’ll end up with for the analysis is a frequency of swears per minute, divided by the number of movies from where they may have come from.
## # A tibble: 6 x 3
## minute swears_freq_adjusted time
## <fct> <dbl> <dbl>
## 1 (0,1] 1.29 1
## 2 (1,2] 2.57 2
## 3 (2,3] 2.57 3
## 4 (3,4] 1.57 4
## 5 (4,5] 1.71 5
## 6 (5,6] 1.71 6
Analysis
Now that that’s done, let’s do some analysis. Below is a scatterplot of the data, along with a linear regression

## (Intercept) time
## 1.8519056107 -0.0009997117
As we can see, the linear regression didn’t capture much of a correlation. It pretty much predicts about 2 swears per minute per movie throughout. Let’s now test the assumptions

It looks like there is more variation on the positive end of the residuals than the negative end. Makes some sense when you think about how there’s going to be bouts of dialogue (and swearing) followed by other things (not including speaking). Let’s take a histogram of the residuals and see what that trend looks like.

Yeah, it’s a bit right-skewed. Our assumption of normally distributed residuals doesn’t seem to hold. Let’s also take a look at a normal QQ-plot below:

There’s a pretty definite curve to it. I’d say that linear regression likely isn’t the best choice for analysis. Looking back to the original regression and scatter plot, I can kinda see a potential curve to it. If there is a trend, I suspect adding a squared term to the formula might help capure any pattern there.