Tarantino Expletives: A Simple Linear Regression

Five-Thirty-Eight actually has a dataset (https://fivethirtyeight.com/features/complete-catalog-curses-deaths-quentin-tarantino-films/) of all the swears and deaths that occur within some of Quentin Tarantino’s films (7 of them), including the time in the films that they occur. Let’s see if the frequency of swearing (more data than deaths) has any relationship to time elapsed within these films.

First, some data prep. What we’re given to begin with is a dataset of the movie, the ‘type’ of data (word or death; we select for word only), the word itself, and the timestamp it occured in mins. Included below is also a summary, in which we can see that some movies have alot more than others.

#Import data
raw <- read.csv('https://raw.githubusercontent.com/davidblumenstiel/data/master/tarantino/tarantino.csv')

df <- subset(raw, type == "word")

head(df)

##            movie type     word minutes_in
## 1 Reservoir Dogs word     dick       0.40
## 2 Reservoir Dogs word    dicks       0.43
## 3 Reservoir Dogs word   fucked       0.55
## 4 Reservoir Dogs word  fucking       0.61
## 5 Reservoir Dogs word bullshit       0.61
## 6 Reservoir Dogs word     fuck       0.66

summary(df)

##                  movie        type           word       minutes_in    
##  Django Unchained   :262   death:   0   fucking:407   Min.   :  0.40  
##  Inglorious Basterds: 58   word :1704   shit   :221   1st Qu.: 28.64  
##  Jackie Brown       :368                fuck   :213   Median : 67.29  
##  Kill Bill: Vol. 1  : 57                n-word :179   Mean   : 69.63  
##  Kill Bill: Vol. 2  : 69                ass    :140   3rd Qu.:111.91  
##  Pulp Fiction       :469                goddamn:113   Max.   :160.28  
##  Reservoir Dogs     :421                (Other):431

Data Prep

We want to get this into a form we can use. We’ll do that by taking tallies of the swears that occur in each minute of film. We also need to keep track of the amount of movies that are running because not all have the same length, and this would bias (more swears) towards earlier minutes.

What we’ll end up with for the analysis is a frequency of swears per minute, divided by the number of movies from where they may have come from.

#Creates groups for the data for each min elapsed
df$minute <- cut(df$minutes_in, breaks = seq(0, 162, by = 1), right = TRUE)


#Counts the number of movies still runnning by X time in mins.
df$nmovies <- NA

i = 0
while (i < nrow(df)) {
  i = i + 1
  
  if(df$minutes_in[i] <= 100) {
    df$nmovies[i] <- 7
  
  } else if(df$minutes_in[i] <= 112) {
    df$nmovies[i] <- 6
    
  } else if(df$minutes_in[i] <= 153) {
    df$nmovies[i] <- 5
    
  } else if(df$minutes_in[i] <= 153) {
    df$nmovies[i] <- 4
    
  } else if(df$minutes_in[i] <= 160) {
    df$nmovies[i] <- 3
    
  } else if(df$minutes_in[i] <= 165) {
    df$nmovies[i] <- 2
    
  } else if(df$minutes_in[i] <= 178) {
    df$nmovies[i] <- 1
    
  }
  
}

#Makes a dataset broken down by the minute group, with the frequency of swears
  
swears <- data.frame()

swears <- group_by(df, minute)%>%
  summarise(swears_freq_adjusted = n()/max(nmovies)) %>%  #The number of swears over the amount of movies for any given minute
  ungroup()


i = 0

# Makes time into a continuous variable
swears$time <- NA
while (i < nrow(swears)) {
  i = i+1
  
  swears$time[i] <- i
  
}




head(swears)

## # A tibble: 6 x 3
##   minute swears_freq_adjusted  time
##   <fct>                 <dbl> <dbl>
## 1 (0,1]                  1.29     1
## 2 (1,2]                  2.57     2
## 3 (2,3]                  2.57     3
## 4 (3,4]                  1.57     4
## 5 (4,5]                  1.71     5
## 6 (5,6]                  1.71     6

Analysis

Now that that’s done, let’s do some analysis. Below is a scatterplot of the data, along with a linear regression

model <- lm(swears_freq_adjusted~time, data = swears)

ggplot(data = swears, aes(time, swears_freq_adjusted)) + 
  geom_point() +
  geom_abline(intercept = model$coefficients[[1]], slope = model$coefficients[[2]], color = 'red')

model$coefficients

##   (Intercept)          time 
##  1.8519056107 -0.0009997117

As we can see, the linear regression didn’t capture much of a correlation. It pretty much predicts about 2 swears per minute per movie throughout. Let’s now test the assumptions

ggplot(fortify(model), aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0)

It looks like there is more variation on the positive end of the residuals than the negative end. Makes some sense when you think about how there’s going to be bouts of dialogue (and swearing) followed by other things (not including speaking). Let’s take a histogram of the residuals and see what that trend looks like.

hist(model$residuals)

Yeah, it’s a bit right-skewed. Our assumption of normally distributed residuals doesn’t seem to hold. Let’s also take a look at a normal QQ-plot below:

qqnorm(swears$swears_freq_adjusted)
qqline(y = swears$swears_freq_adjusted)

There’s a pretty definite curve to it. I’d say that linear regression likely isn’t the best choice for analysis. Looking back to the original regression and scatter plot, I can kinda see a potential curve to it. If there is a trend, I suspect adding a squared term to the formula might help capure any pattern there.

Tarantino Expletives: A Simple Linear Regression

David Blumenstiel

11/2/2020

Data Prep

We want to get this into a form we can use. We’ll do that by taking tallies of the swears that occur in each minute of film. We also need to keep track of the amount of movies that are running because not all have the same length, and this would bias (more swears) towards earlier minutes.

What we’ll end up with for the analysis is a frequency of swears per minute, divided by the number of movies from where they may have come from.

Analysis

Now that that’s done, let’s do some analysis. Below is a scatterplot of the data, along with a linear regression

As we can see, the linear regression didn’t capture much of a correlation. It pretty much predicts about 2 swears per minute per movie throughout. Let’s now test the assumptions

Yeah, it’s a bit right-skewed. Our assumption of normally distributed residuals doesn’t seem to hold. Let’s also take a look at a normal QQ-plot below: