607 Project 3 - Envelope Please -Analysis

In this project we will explore whether Film Editing is the best predictor of Best Picture. Chris and Jason Joseph obtained, parsed and transformed text file from awards database to csv file. This data will be used to do exploratory analysis below.

Is Film Editing the best predictor of Best Picture?

require(tidyr)
require(dplyr)
require(knitr)
require(stringr)
require(ggplot2)

Load csv File

# Load Master File
academy_awards <- read.csv(file="https://raw.githubusercontent.com/mkds/IS607_Project3/gh-pages/Data/project_view_1981_2014.csv", stringsAsFactors = FALSE)
kable(head(academy_awards))

Year	Category	Nominee	Won
1981	ACTOR IN A LEADING ROLE	Reds	no
1981	ACTOR IN A LEADING ROLE	On Golden Pond	yes
1981	ACTOR IN A LEADING ROLE	Atlantic City	no
1981	ACTOR IN A LEADING ROLE	Arthur	no
1981	ACTOR IN A LEADING ROLE	Absence of Malice	no
1981	ACTOR IN A SUPPORTING ROLE	Only When I Laugh	no

I will separate Best Picture into its own data frame and then join with other like categories below.

# Separate Best Picture won into different data frame
best_picture <- academy_awards %>% 
                select(Year, Category, Nominee, Won) %>%
                filter(Won == "yes", 
                       Category %in% c("BEST PICTURE", "BEST MOTION PICTURE")) %>%
                arrange(Year)
kable(head(best_picture))

Year	Category	Nominee	Won
1981	BEST PICTURE	Chariots of Fire	yes
1982	BEST PICTURE	Gandhi	yes
1983	BEST PICTURE	Terms of Endearment	yes
1984	BEST PICTURE	Amadeus	yes
1985	BEST PICTURE	Out of Africa	yes
1986	BEST PICTURE	Platoon	yes

Create another data frame that contains all Categories except for Best Picture

# Select Categories and other columns into different data frame
#separate(Category,c("Category","CategoryType"), sep=" IN A ") %>%
categories_won <- academy_awards %>% select(Year, Category, Nominee, Won) %>%
                  filter(Won == "yes", 
                         !(Category %in% c("BEST PICTURE"))) %>%
                  arrange(Year)
kable(head(categories_won))

Year	Category	Nominee	Won
1981	ACTOR IN A LEADING ROLE	On Golden Pond	yes
1981	ACTOR IN A SUPPORTING ROLE	Arthur	yes
1981	ACTRESS IN A LEADING ROLE	On Golden Pond	yes
1981	ACTRESS IN A SUPPORTING ROLE	Reds	yes
1981	CINEMATOGRAPHY	Reds	yes
1981	COSTUME DESIGN	Chariots of Fire	yes

Now we have 2 data frames, Best Picture and Others. Let us join them based on Nominees

# Join the data frames based on the Best Pictures won
category_picture_joined <- categories_won %>% inner_join(best_picture, by = "Nominee")
colnames(category_picture_joined)[2] <- "Category"
kable(head(category_picture_joined))

Year.x	Category	Nominee	Won.x	Year.y	Category.y	Won.y
1981	COSTUME DESIGN	Chariots of Fire	yes	1981	BEST PICTURE	yes
1982	CINEMATOGRAPHY	Gandhi	yes	1982	BEST PICTURE	yes
1982	COSTUME DESIGN	Gandhi	yes	1982	BEST PICTURE	yes
1982	DIRECTING	Gandhi	yes	1982	BEST PICTURE	yes
1982	FILM EDITING	Gandhi	yes	1982	BEST PICTURE	yes
1983	DIRECTING	Terms of Endearment	yes	1983	BEST PICTURE	yes

Now find frequencies of Categories that also won Best Picture

# Find the frequence and percent of each category 
category_freq <- category_picture_joined %>%  
                 group_by(Category) %>% summarise(count = n()) %>%
                 mutate(percent=round(count/sum(count)*100,digits=2)) %>% arrange(desc(percent))
kable(category_freq)

Category	count	percent
DIRECTING	25	37.31
FILM EDITING	16	23.88
CINEMATOGRAPHY	11	16.42
COSTUME DESIGN	11	16.42
SOUND MIXING	3	4.48
SOUND EDITING	1	1.49

Plot to visually see which Category relates to Best Picture.

# Plot it
g <- ggplot(category_freq, aes(x=Category, y=percent, fill=Category)) 
g + geom_bar(stat = "identity") + 
    coord_flip() + 
    ggtitle("Best Film Winning Other Categories") +
    xlab("Categories") +
    ylab("Frequency(%)")

Conclusion: Surprisingly, Best Picture also won DIRECTING category 25 times(37.31%) and FILM EDITING only 16 times! This shows that Film Editing is not the best predictor of Best Picture.

Is there a coorelation between winning Categories and Best Picture Nominees that did not win?

Let’s start by separating non-winning Best Picture Nominees.

not_so_best_picture <- academy_awards %>% 
                select(Year, Category, Nominee, Won) %>%
                filter(Won == "no", 
                       Category %in% c("BEST PICTURE")) %>%
                arrange(Year)
kable(head(not_so_best_picture))

Year	Category	Nominee	Won
1981	BEST PICTURE	Atlantic City	no
1981	BEST PICTURE	On Golden Pond	no
1981	BEST PICTURE	Raiders of the Lost Ark	no
1981	BEST PICTURE	Reds	no
1982	BEST PICTURE	E.T. The Extra-Terrestrial	no
1982	BEST PICTURE	Missing	no

Join with categories that it won

category_picture_joined_2 <- categories_won %>% inner_join(not_so_best_picture, by = "Nominee")
colnames(category_picture_joined_2)[2] <- "Category"
kable(head(category_picture_joined_2))

Year.x	Category	Nominee	Won.x	Year.y	Category.y	Won.y
1981	CINEMATOGRAPHY	Reds	yes	1981	BEST PICTURE	no
1981	DIRECTING	Reds	yes	1981	BEST PICTURE	no
1981	FILM EDITING	Raiders of the Lost Ark	yes	1981	BEST PICTURE	no
1983	FILM EDITING	The Right Stuff	yes	1983	BEST PICTURE	no
1984	CINEMATOGRAPHY	The Killing Fields	yes	1984	BEST PICTURE	no
1984	FILM EDITING	The Killing Fields	yes	1984	BEST PICTURE	no

category_freq_2 <- category_picture_joined_2 %>%  
                 group_by(Category) %>% summarise(count = n()) %>%
                 mutate(percent=round(count/sum(count)*100,digits=2)) %>% arrange(desc(percent))
kable(category_freq_2)

Category	count	percent
CINEMATOGRAPHY	15	27.27
FILM EDITING	12	21.82
DIRECTING	8	14.55
SOUND EDITING	8	14.55
COSTUME DESIGN	6	10.91
SOUND MIXING	6	10.91

g_2 <- ggplot(category_freq_2, aes(x=Category, y=percent, fill=Category)) 
g_2 + geom_bar(stat = "identity") + 
      coord_flip() + 
      scale_fill_brewer() +
      ggtitle("Not So Best Film Winning Other Categories") +
      xlab("Categories") +
      ylab("Frequency(%)")

Conclusion: Based on the chart above, if BEST PICTURE nominee did not win, it has 27.27% chance of atleast winnging CINEMATOGRAPHY.

Lets plot both together and then compare.

g_3 <- ggplot() + geom_line(data = category_freq_2, aes(x = Category, y = percent, color = "Best Picture Non-Winner", group =1))
g_3 + geom_line(data = category_freq, aes(x = Category, y = percent, color = "Best Picture Winner", group=1)) +
  ggtitle("Category vs Best Picture Winner and Non-Winner") +
  xlab('Category') +
  ylab('frequency(%)') +
  scale_colour_manual("", 
                      values = c("Best Picture Winner"="green", "Best Picture Non-Winner"="red"))

607 Project 3 - Envelope Please -Analysis

IS 607.1 Team

October 18, 2015

Conclusion: Surprisingly, Best Picture also won DIRECTING category 25 times(37.31%) and FILM EDITING only 16 times! This shows that Film Editing is not the best predictor of Best Picture.

Conclusion: Based on the chart above, if BEST PICTURE nominee did not win, it has 27.27% chance of atleast winnging CINEMATOGRAPHY.

Best Picture winner mostly won DIRECTING and Best Picture non-winners won CINEMATOGRAPHY.