In this project we will explore whether Film Editing is the best predictor of Best Picture. Chris and Jason Joseph obtained, parsed and transformed text file from awards database to csv file. This data will be used to do exploratory analysis below.
Is Film Editing the best predictor of Best Picture?
require(tidyr)
require(dplyr)
require(knitr)
require(stringr)
require(ggplot2)
Load csv File
# Load Master File
academy_awards <- read.csv(file="https://raw.githubusercontent.com/mkds/IS607_Project3/gh-pages/Data/project_view_1981_2014.csv", stringsAsFactors = FALSE)
kable(head(academy_awards))
| Year | Category | Nominee | Won |
|---|---|---|---|
| 1981 | ACTOR IN A LEADING ROLE | Reds | no |
| 1981 | ACTOR IN A LEADING ROLE | On Golden Pond | yes |
| 1981 | ACTOR IN A LEADING ROLE | Atlantic City | no |
| 1981 | ACTOR IN A LEADING ROLE | Arthur | no |
| 1981 | ACTOR IN A LEADING ROLE | Absence of Malice | no |
| 1981 | ACTOR IN A SUPPORTING ROLE | Only When I Laugh | no |
I will separate Best Picture into its own data frame and then join with other like categories below.
# Separate Best Picture won into different data frame
best_picture <- academy_awards %>%
select(Year, Category, Nominee, Won) %>%
filter(Won == "yes",
Category %in% c("BEST PICTURE", "BEST MOTION PICTURE")) %>%
arrange(Year)
kable(head(best_picture))
| Year | Category | Nominee | Won |
|---|---|---|---|
| 1981 | BEST PICTURE | Chariots of Fire | yes |
| 1982 | BEST PICTURE | Gandhi | yes |
| 1983 | BEST PICTURE | Terms of Endearment | yes |
| 1984 | BEST PICTURE | Amadeus | yes |
| 1985 | BEST PICTURE | Out of Africa | yes |
| 1986 | BEST PICTURE | Platoon | yes |
Create another data frame that contains all Categories except for Best Picture
# Select Categories and other columns into different data frame
#separate(Category,c("Category","CategoryType"), sep=" IN A ") %>%
categories_won <- academy_awards %>% select(Year, Category, Nominee, Won) %>%
filter(Won == "yes",
!(Category %in% c("BEST PICTURE"))) %>%
arrange(Year)
kable(head(categories_won))
| Year | Category | Nominee | Won |
|---|---|---|---|
| 1981 | ACTOR IN A LEADING ROLE | On Golden Pond | yes |
| 1981 | ACTOR IN A SUPPORTING ROLE | Arthur | yes |
| 1981 | ACTRESS IN A LEADING ROLE | On Golden Pond | yes |
| 1981 | ACTRESS IN A SUPPORTING ROLE | Reds | yes |
| 1981 | CINEMATOGRAPHY | Reds | yes |
| 1981 | COSTUME DESIGN | Chariots of Fire | yes |
Now we have 2 data frames, Best Picture and Others. Let us join them based on Nominees
# Join the data frames based on the Best Pictures won
category_picture_joined <- categories_won %>% inner_join(best_picture, by = "Nominee")
colnames(category_picture_joined)[2] <- "Category"
kable(head(category_picture_joined))
| Year.x | Category | Nominee | Won.x | Year.y | Category.y | Won.y |
|---|---|---|---|---|---|---|
| 1981 | COSTUME DESIGN | Chariots of Fire | yes | 1981 | BEST PICTURE | yes |
| 1982 | CINEMATOGRAPHY | Gandhi | yes | 1982 | BEST PICTURE | yes |
| 1982 | COSTUME DESIGN | Gandhi | yes | 1982 | BEST PICTURE | yes |
| 1982 | DIRECTING | Gandhi | yes | 1982 | BEST PICTURE | yes |
| 1982 | FILM EDITING | Gandhi | yes | 1982 | BEST PICTURE | yes |
| 1983 | DIRECTING | Terms of Endearment | yes | 1983 | BEST PICTURE | yes |
Now find frequencies of Categories that also won Best Picture
# Find the frequence and percent of each category
category_freq <- category_picture_joined %>%
group_by(Category) %>% summarise(count = n()) %>%
mutate(percent=round(count/sum(count)*100,digits=2)) %>% arrange(desc(percent))
kable(category_freq)
| Category | count | percent |
|---|---|---|
| DIRECTING | 25 | 37.31 |
| FILM EDITING | 16 | 23.88 |
| CINEMATOGRAPHY | 11 | 16.42 |
| COSTUME DESIGN | 11 | 16.42 |
| SOUND MIXING | 3 | 4.48 |
| SOUND EDITING | 1 | 1.49 |
Plot to visually see which Category relates to Best Picture.
# Plot it
g <- ggplot(category_freq, aes(x=Category, y=percent, fill=Category))
g + geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Best Film Winning Other Categories") +
xlab("Categories") +
ylab("Frequency(%)")
Is there a coorelation between winning Categories and Best Picture Nominees that did not win?
Let’s start by separating non-winning Best Picture Nominees.
not_so_best_picture <- academy_awards %>%
select(Year, Category, Nominee, Won) %>%
filter(Won == "no",
Category %in% c("BEST PICTURE")) %>%
arrange(Year)
kable(head(not_so_best_picture))
| Year | Category | Nominee | Won |
|---|---|---|---|
| 1981 | BEST PICTURE | Atlantic City | no |
| 1981 | BEST PICTURE | On Golden Pond | no |
| 1981 | BEST PICTURE | Raiders of the Lost Ark | no |
| 1981 | BEST PICTURE | Reds | no |
| 1982 | BEST PICTURE | E.T. The Extra-Terrestrial | no |
| 1982 | BEST PICTURE | Missing | no |
Join with categories that it won
category_picture_joined_2 <- categories_won %>% inner_join(not_so_best_picture, by = "Nominee")
colnames(category_picture_joined_2)[2] <- "Category"
kable(head(category_picture_joined_2))
| Year.x | Category | Nominee | Won.x | Year.y | Category.y | Won.y |
|---|---|---|---|---|---|---|
| 1981 | CINEMATOGRAPHY | Reds | yes | 1981 | BEST PICTURE | no |
| 1981 | DIRECTING | Reds | yes | 1981 | BEST PICTURE | no |
| 1981 | FILM EDITING | Raiders of the Lost Ark | yes | 1981 | BEST PICTURE | no |
| 1983 | FILM EDITING | The Right Stuff | yes | 1983 | BEST PICTURE | no |
| 1984 | CINEMATOGRAPHY | The Killing Fields | yes | 1984 | BEST PICTURE | no |
| 1984 | FILM EDITING | The Killing Fields | yes | 1984 | BEST PICTURE | no |
category_freq_2 <- category_picture_joined_2 %>%
group_by(Category) %>% summarise(count = n()) %>%
mutate(percent=round(count/sum(count)*100,digits=2)) %>% arrange(desc(percent))
kable(category_freq_2)
| Category | count | percent |
|---|---|---|
| CINEMATOGRAPHY | 15 | 27.27 |
| FILM EDITING | 12 | 21.82 |
| DIRECTING | 8 | 14.55 |
| SOUND EDITING | 8 | 14.55 |
| COSTUME DESIGN | 6 | 10.91 |
| SOUND MIXING | 6 | 10.91 |
g_2 <- ggplot(category_freq_2, aes(x=Category, y=percent, fill=Category))
g_2 + geom_bar(stat = "identity") +
coord_flip() +
scale_fill_brewer() +
ggtitle("Not So Best Film Winning Other Categories") +
xlab("Categories") +
ylab("Frequency(%)")
Lets plot both together and then compare.
g_3 <- ggplot() + geom_line(data = category_freq_2, aes(x = Category, y = percent, color = "Best Picture Non-Winner", group =1))
g_3 + geom_line(data = category_freq, aes(x = Category, y = percent, color = "Best Picture Winner", group=1)) +
ggtitle("Category vs Best Picture Winner and Non-Winner") +
xlab('Category') +
ylab('frequency(%)') +
scale_colour_manual("",
values = c("Best Picture Winner"="green", "Best Picture Non-Winner"="red"))