In this project we will explore whether Film Editing is the best predictor of Best Picture. Chris and Jason Joseph obtained, parsed and transformed text file from awards database to csv file. This data will be used to do exploratory analysis below.

Is Film Editing the best predictor of Best Picture?

require(tidyr)
require(dplyr)
require(knitr)
require(stringr)
require(ggplot2)

Load csv File

# Load Master File
academy_awards <- read.csv(file="https://raw.githubusercontent.com/mkds/IS607_Project3/gh-pages/Data/project_view_1981_2014.csv", stringsAsFactors = FALSE)
kable(head(academy_awards))
Year Category Nominee Won
1981 ACTOR IN A LEADING ROLE Reds no
1981 ACTOR IN A LEADING ROLE On Golden Pond yes
1981 ACTOR IN A LEADING ROLE Atlantic City no
1981 ACTOR IN A LEADING ROLE Arthur no
1981 ACTOR IN A LEADING ROLE Absence of Malice no
1981 ACTOR IN A SUPPORTING ROLE Only When I Laugh no

I will separate Best Picture into its own data frame and then join with other like categories below.

# Separate Best Picture won into different data frame
best_picture <- academy_awards %>% 
                select(Year, Category, Nominee, Won) %>%
                filter(Won == "yes", 
                       Category %in% c("BEST PICTURE", "BEST MOTION PICTURE")) %>%
                arrange(Year)
kable(head(best_picture))
Year Category Nominee Won
1981 BEST PICTURE Chariots of Fire yes
1982 BEST PICTURE Gandhi yes
1983 BEST PICTURE Terms of Endearment yes
1984 BEST PICTURE Amadeus yes
1985 BEST PICTURE Out of Africa yes
1986 BEST PICTURE Platoon yes

Create another data frame that contains all Categories except for Best Picture

# Select Categories and other columns into different data frame
#separate(Category,c("Category","CategoryType"), sep=" IN A ") %>%
categories_won <- academy_awards %>% select(Year, Category, Nominee, Won) %>%
                  filter(Won == "yes", 
                         !(Category %in% c("BEST PICTURE"))) %>%
                  arrange(Year)
kable(head(categories_won))
Year Category Nominee Won
1981 ACTOR IN A LEADING ROLE On Golden Pond yes
1981 ACTOR IN A SUPPORTING ROLE Arthur yes
1981 ACTRESS IN A LEADING ROLE On Golden Pond yes
1981 ACTRESS IN A SUPPORTING ROLE Reds yes
1981 CINEMATOGRAPHY Reds yes
1981 COSTUME DESIGN Chariots of Fire yes

Now we have 2 data frames, Best Picture and Others. Let us join them based on Nominees

# Join the data frames based on the Best Pictures won
category_picture_joined <- categories_won %>% inner_join(best_picture, by = "Nominee")
colnames(category_picture_joined)[2] <- "Category"
kable(head(category_picture_joined))
Year.x Category Nominee Won.x Year.y Category.y Won.y
1981 COSTUME DESIGN Chariots of Fire yes 1981 BEST PICTURE yes
1982 CINEMATOGRAPHY Gandhi yes 1982 BEST PICTURE yes
1982 COSTUME DESIGN Gandhi yes 1982 BEST PICTURE yes
1982 DIRECTING Gandhi yes 1982 BEST PICTURE yes
1982 FILM EDITING Gandhi yes 1982 BEST PICTURE yes
1983 DIRECTING Terms of Endearment yes 1983 BEST PICTURE yes

Now find frequencies of Categories that also won Best Picture

# Find the frequence and percent of each category 
category_freq <- category_picture_joined %>%  
                 group_by(Category) %>% summarise(count = n()) %>%
                 mutate(percent=round(count/sum(count)*100,digits=2)) %>% arrange(desc(percent))
kable(category_freq)
Category count percent
DIRECTING 25 37.31
FILM EDITING 16 23.88
CINEMATOGRAPHY 11 16.42
COSTUME DESIGN 11 16.42
SOUND MIXING 3 4.48
SOUND EDITING 1 1.49

Plot to visually see which Category relates to Best Picture.

# Plot it
g <- ggplot(category_freq, aes(x=Category, y=percent, fill=Category)) 
g + geom_bar(stat = "identity") + 
    coord_flip() + 
    ggtitle("Best Film Winning Other Categories") +
    xlab("Categories") +
    ylab("Frequency(%)")

Conclusion: Surprisingly, Best Picture also won DIRECTING category 25 times(37.31%) and FILM EDITING only 16 times! This shows that Film Editing is not the best predictor of Best Picture.

Is there a coorelation between winning Categories and Best Picture Nominees that did not win?

Let’s start by separating non-winning Best Picture Nominees.

not_so_best_picture <- academy_awards %>% 
                select(Year, Category, Nominee, Won) %>%
                filter(Won == "no", 
                       Category %in% c("BEST PICTURE")) %>%
                arrange(Year)
kable(head(not_so_best_picture))
Year Category Nominee Won
1981 BEST PICTURE Atlantic City no
1981 BEST PICTURE On Golden Pond no
1981 BEST PICTURE Raiders of the Lost Ark no
1981 BEST PICTURE Reds no
1982 BEST PICTURE E.T. The Extra-Terrestrial no
1982 BEST PICTURE Missing no

Join with categories that it won

category_picture_joined_2 <- categories_won %>% inner_join(not_so_best_picture, by = "Nominee")
colnames(category_picture_joined_2)[2] <- "Category"
kable(head(category_picture_joined_2))
Year.x Category Nominee Won.x Year.y Category.y Won.y
1981 CINEMATOGRAPHY Reds yes 1981 BEST PICTURE no
1981 DIRECTING Reds yes 1981 BEST PICTURE no
1981 FILM EDITING Raiders of the Lost Ark yes 1981 BEST PICTURE no
1983 FILM EDITING The Right Stuff yes 1983 BEST PICTURE no
1984 CINEMATOGRAPHY The Killing Fields yes 1984 BEST PICTURE no
1984 FILM EDITING The Killing Fields yes 1984 BEST PICTURE no
category_freq_2 <- category_picture_joined_2 %>%  
                 group_by(Category) %>% summarise(count = n()) %>%
                 mutate(percent=round(count/sum(count)*100,digits=2)) %>% arrange(desc(percent))
kable(category_freq_2)
Category count percent
CINEMATOGRAPHY 15 27.27
FILM EDITING 12 21.82
DIRECTING 8 14.55
SOUND EDITING 8 14.55
COSTUME DESIGN 6 10.91
SOUND MIXING 6 10.91
g_2 <- ggplot(category_freq_2, aes(x=Category, y=percent, fill=Category)) 
g_2 + geom_bar(stat = "identity") + 
      coord_flip() + 
      scale_fill_brewer() +
      ggtitle("Not So Best Film Winning Other Categories") +
      xlab("Categories") +
      ylab("Frequency(%)")

Conclusion: Based on the chart above, if BEST PICTURE nominee did not win, it has 27.27% chance of atleast winnging CINEMATOGRAPHY.

Lets plot both together and then compare.

g_3 <- ggplot() + geom_line(data = category_freq_2, aes(x = Category, y = percent, color = "Best Picture Non-Winner", group =1))
g_3 + geom_line(data = category_freq, aes(x = Category, y = percent, color = "Best Picture Winner", group=1)) +
  ggtitle("Category vs Best Picture Winner and Non-Winner") +
  xlab('Category') +
  ylab('frequency(%)') +
  scale_colour_manual("", 
                      values = c("Best Picture Winner"="green", "Best Picture Non-Winner"="red")) 

Best Picture winner mostly won DIRECTING and Best Picture non-winners won CINEMATOGRAPHY.