Five-Thirty-Eight actually has a dataset (https://fivethirtyeight.com/features/complete-catalog-curses-deaths-quentin-tarantino-films/) of all the swears and deaths that occur within some of Quentin Tarantino’s films (7 of them), including the time in the films that they occur. Let’s see if the frequency of swearing (more data than deaths) has any relationship to time elapsed within these films.

First, some data prep. What we’re given to begin with is a dataset of the movie, the ‘type’ of data (word or death; we select for word only), the word itself, and the timestamp it occured in mins. Included below is also a summary, in which we can see that some movies have alot more than others.

##            movie type     word minutes_in
## 1 Reservoir Dogs word     dick       0.40
## 2 Reservoir Dogs word    dicks       0.43
## 3 Reservoir Dogs word   fucked       0.55
## 4 Reservoir Dogs word  fucking       0.61
## 5 Reservoir Dogs word bullshit       0.61
## 6 Reservoir Dogs word     fuck       0.66
##                  movie        type           word       minutes_in    
##  Django Unchained   :262   death:   0   fucking:407   Min.   :  0.40  
##  Inglorious Basterds: 58   word :1704   shit   :221   1st Qu.: 28.64  
##  Jackie Brown       :368                fuck   :213   Median : 67.29  
##  Kill Bill: Vol. 1  : 57                n-word :179   Mean   : 69.63  
##  Kill Bill: Vol. 2  : 69                ass    :140   3rd Qu.:111.91  
##  Pulp Fiction       :469                goddamn:113   Max.   :160.28  
##  Reservoir Dogs     :421                (Other):431

Data Prep

We want to get this into a form we can use. We’ll do that by taking tallies of the swears that occur in each minute of film. We also need to keep track of the amount of movies that are running because not all have the same length, and this would bias (more swears) towards earlier minutes.

Analysis

Now that that’s done, let’s do some analysis. Below is a scatterplot of the data, along with a linear regression

##   (Intercept)          time 
##  1.8519056107 -0.0009997117

As we can see, the linear regression didn’t capture much of a correlation. It pretty much predicts about 2 swears per minute per movie throughout. Let’s now test the assumptions

It looks like there is more variation on the positive end of the residuals than the negative end. Makes some sense when you think about how there’s going to be bouts of dialogue (and swearing) followed by other things (not including speaking). Let’s take a histogram of the residuals and see what that trend looks like.

Yeah, it’s a bit right-skewed. Our assumption of normally distributed residuals doesn’t seem to hold. Let’s also take a look at a normal QQ-plot below:

There’s a pretty definite curve to it. I’d say that linear regression likely isn’t the best choice for analysis. Looking back to the original regression and scatter plot, I can kinda see a potential curve to it. If there is a trend, I suspect adding a squared term to the formula might help capure any pattern there.