Some time ago, my girlfriend and I watched the movie Spanglish, a comedy/drama/dramedy telling the story of a mother (Flor) and daughter (Cristina) who had emigrated from Mexico to start a new life in the USA. The movie is narrated from the view of the daughter, through her college application essay to Princeton University.

The movie was fine, but I was unsatisfied at the end. Unsatisfied with the fact that the movie was over two hours long, and the daughter’s essay (the entire narrative) seemed awfully long compared to a typical college essay. After months of sleepless nights, I decided to figure out just how long Cristina’s essay really was.


I use base R for the entire thing, except for the optional word count at the end. I found the script online (this ought to count as fair use). Data import is rather straightforward:

spang <- readLines('~/spanglish/texts/spanglish.txt')

Each element is a line.

head(spang)
## [1] ""                                                            
## [2] "FADE IN:"                                                    
## [3] ""                                                            
## [4] "INT. BEDROOM - MEDIUM CLOSE - MALE FORM - LATE AFTERNOON"    
## [5] ""                                                            
## [6] "A shape fills the lower portion of the screen. It is a man's"
tail(spang)
## [1] "\t\tjustice. I love her with all my"                                                            
## [2] "\t\theart."                                                                                     
## [3] ""                                                                                               
## [4] "\t\t\t\t        FADE OUT."                                                                      
## [5] ""                                                                                               
## [6] "All movie scripts and screenplays on «Screenplays for You» site are intended for fair use only."

There are an awful lot of blank lines. We’ll clean those up.

spang <- spang[spang!=''] # remove empty space lines
head(spang)
## [1] "FADE IN:"                                                    
## [2] "INT. BEDROOM - MEDIUM CLOSE - MALE FORM - LATE AFTERNOON"    
## [3] "A shape fills the lower portion of the screen. It is a man's"
## [4] "back..... a perfect back... good dark color, slim, muscular."
## [5] "LATIN MUSIC PLAYS... a song.... if you understood the words" 
## [6] "you would hear love confronted and considered in a very"

The next part is a little complicated, but we can try to figure out where Cristina’s lines start and end by finding the index of the next line that is all caps (signaling a potential change in speaker, but resistant to the semi-frequent NARRATOR (CONT'D) lines):

upps <- spang == toupper(spang) # which lines are all uppercase? (usually a speaker)
narrs <- which(grepl('NARR', spang)) # which lines are the start of narration?
narrs2 <- narrs + 1 
out <- c(1:length(narrs2)) # preallocate storage

for (ii in 1:length(narrs2)){
  # for each narrator mention, find the next occurrence of an all-caps line
  # Then, remove lines without tabs (usually actor directions)
  idx <- min(which(upps[narrs2[ii]:length(upps)] == 1)) + narrs[ii] - 1
  
  out[ii] <-paste(spang[narrs2[ii]:idx][grepl('\t', spang[narrs2[ii]:idx])],
                   sep = ' ', collapse = ' ') # for phrases sandwiched between the narr start and
                                              # the subsequent line, check for at least one tab
}

And we’ll collapse the lines together with paste().

out <- paste(out, sep = ' ', collapse = ' ')
out <- gsub('\t', '', out) # remove tabs, now that we're done with them

The last problematic part is that parentheticals are used to refer to happenings in the scene, and should not count toward the essay word count.

out <- gsub( ' *\\(.*?\\) *', '', out) # remove all parentheticals

Now for the word counts! One is from the qdap package, and the other is a solution from this Stack Overflow answer:

qdap::wc(out, digit.remove = FALSE)
vapply(strsplit(out, "\\W+"), length, integer(1))
## [1] 1219
## [1] 1242

An online word counter gives me 1228 words, and OpenOffice gives me 1222. I think the vapply(...) method is getting hung up on contractions, and the small deviations amongst the others might be differences in handling Arabic numerals. Just glancing over the output, I think this method is rather successful in parsing out only the narrator’s lines.


How does this compare to college application essay lengths? The Common Application did not have a word limit until recently, but the program did not exist at the time of the film (it started in 2007, while the movie is from 2004). Princeton University does require a 250-650 word essay in addition to the Common App essay, but this is also likely to be more recent. However, this forum post from 2004 suggests that the max essay length around that time was 1000 words.

zantedeschia:

I wrote
--Tell us about yourself (680 words)
--Topic of choice (370 words)
--Significant activity? Forgot how they worded it. (370ish words)
...

EncomiumII:

zante-you don't think being +120, +120, and +180 words over the limit might ...uhhh...bug people? 

So our favorite narrator overshot by roughly 220 words. Would this have affected her chances of getting admission? Does it matter?

To wrap up, this was a fun little excursion into text/movie script parsing, and confirmed my suspicions that Cristina’s essay was far too long.