Data Massaging
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
setwd("C:/Users/Sekhar/Documents/Github/Project_3_607_My_Work")
awards_df <- read.csv("Awards_File.csv",stringsAsFactors=F)
awards_modified <- awards_df[,c(1,3,4,6)]
#head(awards_modified)
The awards_modified data frame has some movies, which are nominated for the same category more than once, and has also won for one of its nominations (see below example)
awards_modified[awards_modified$movie_id==1288,]
## movie_id year category_id won
## 2720 1288 1984 1 0
## 2721 1288 1984 1 1
## 2722 1288 1984 6 1
## 2723 1288 1984 7 0
## 2724 1288 1984 8 1
## 2725 1288 1984 9 1
## 2726 1288 1984 12 0
## 2727 1288 1984 13 1
## 2728 1288 1984 16 1
## 2729 1288 1984 19 1
## 2730 1288 1984 22 1
Display-1: Data showing the presence of multiple nominations for the same movie in the same category
To fix the above problem, we have to group by movie_id, year, category_id, and max(won) such grouping of data will include just the winning nomination, if a film is nominated multiple times in the same category, and one of the nominations wins in that category.
awards_modified <- awards_modified %>%
group_by(movie_id,year,category_id) %>%
summarise(won=max(won))
The above transformation has eliminated all the duplicate rows, where the same movie has been nominated in the same category more than once and one of them wins. It just includes the winning nomination. If none of the multiple nominations win, then only one of the nominations is included.
The below command confirms that for the movie_id=1288, only the winning nomination is inluded
awards_modified[awards_modified$movie_id==1288,]
## Source: local data frame [10 x 4]
## Groups: movie_id, year
##
## movie_id year category_id won
## 1 1288 1984 1 1
## 2 1288 1984 6 1
## 3 1288 1984 7 0
## 4 1288 1984 8 1
## 5 1288 1984 9 1
## 6 1288 1984 12 0
## 7 1288 1984 13 1
## 8 1288 1984 16 1
## 9 1288 1984 19 1
## 10 1288 1984 22 1
Display-2: Data showing the elimination of multiple nominations from the same movie, in the same award category
Now the spread() function can be applied to awards_modified data frame.
awards_re_modified <- spread((awards_modified),category_id,won)
Applying the spread() function to obtain the data for final analysis
The above command will produce another data frame called awards_re_modified, with the following columns: movie_id - Movie IDentifier Year - Year of award 1 to 23 - Award categories from 1 to 23 Wherever a movie wins an award, the respective category will have 1, wherever a movie is nominated, then the corresponding category will have 0, and wherever the movie is neither nominated nor wins, we will have NA
head(awards_re_modified)
## Source: local data frame [6 x 25]
##
## movie_id year 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 1 1 2010 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 2 2 2010 0 NA NA 0 NA 0 0 0 0 NA NA NA NA NA NA 0 NA NA 0
## 3 3 2010 0 NA NA NA NA NA 0 NA 0 NA NA 1 NA 1 NA 0 NA NA 0
## 4 4 2010 1 0 NA 0 NA 0 0 0 1 NA NA 0 NA 0 NA 1 NA NA 0
## 5 5 2010 0 NA NA NA NA NA NA NA NA NA NA 0 NA 0 0 0 NA NA NA
## 6 6 2010 NA 1 NA 1 NA NA NA NA 0 NA NA 0 NA NA NA 0 NA NA NA
## Variables not shown: 20 (int), 21 (int), 22 (int), 23 (int)
Display-3: Sample data from the final data frame, which will be used for analysis
To get the category names (since the above dataframe contains the categories as numbers/codes), let us create a separate data frame award_categories:
award_categories <- unique(data.frame(category_id=awards_df$category_id,category_name=awards_df$category_name))
rownames(award_categories) <- NULL
award_categories
## category_id category_name
## 1 1 ACTOR -- LEADING ROLE
## 2 4 ACTRESS -- SUPPORTING ROLE
## 3 6 ART DIRECTION
## 4 7 CINEMATOGRAPHY
## 5 8 COSTUME DESIGN
## 6 9 DIRECTING
## 7 16 BEST PICTURE
## 8 19 SOUND
## 9 20 SOUND EDITING
## 10 22 WRITING
## 11 12 FILM EDITING
## 12 14 MUSIC (SCORING)
## 13 2 ACTOR -- SUPPORTING ROLE
## 14 15 MUSIC (SONG)
## 15 3 ACTRESS -- LEADING ROLE
## 16 5 ANIMATED FEATURE FILM
## 17 21 VISUAL EFFECTS
## 18 10 DOCUMENTARY (FEATURE)
## 19 11 DOCUMENTARY (SHORT SUBJECT)
## 20 13 MAKEUP
## 21 17 SHORT FILM (ANIMATED)
## 22 18 SHORT FILM (LIVE ACTION)
## 23 23 DOCUMENTARY (OTHER)
Display-4: award_categories data frame
Since the variables are not allowed to start with numeric values, let us rename the column names of the data frame awards_re_modified to character variables. For example, the variable name “1” represents the award category_ID 1. We will change this to “c1”“, to represent Award category-1.
names(awards_re_modified) <- c("movie_id", "year", "c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10", "c11", "c12",
"c13", "c14", "c15","c16", "c17", "c18", "c19", "c20", "c21", "c22", "c23")
So finally we have the following data frames to work/use for our analysis: award_categories awards_re_modified
Main Objectives of the analysis
The Best Picture category_id is 16 and Film Editing category_id is 12. We have to identify which categories can help us to predict if a film gets the best picture, and determine if film_editing has the maximum probability.
Our specific objectives are:
- Probability of getting Best Movie, given that the movie is nominated in a paricular award category?
- Probability of getting Best Movie, given that the movie wins in a paricular award category?
Data Analysis process
Let us use conditional probability to determine the following:
Objective-1 Probability of getting Best Movie, given that the movie is “nominated”" in a paricular award category?
Objective-2 Probability of getting Best Movie, given that the movie “wins”" in a paricular award category?
R Code for Objective-1
p <- vector(length=22)
category_id <- vector(length=22)
category_name <- vector(length=22)
category_name <- NULL
for(i in 1:22)
{
x <- temp[,c(i,23)]
#Get the probability of getting a movie nominated in category i
#First get the number of movies, and then get the number of movies for which the category_i is 1 or 0
category_nominated_count <- sum(!is.na(x[,1])) #if category_i is 0 or 1, then the movie is nominated
Exhaustive_movies_count <- nrow(x)
#Hence the probability of getting a movie nominated is ...
p_of_getting_nom_in_i <- (category_nominated_count / Exhaustive_movies_count)
#Now get the count of movies which won the best movie given that they were nominated in category_i
p_of_best_pic_and_get_nom_in_i <- (sum(!is.na((x[which(x[,2] == 1),])[1])) / nrow(x))
#Now, the probability of getting best movie given that it is nominated in category_i
p[i] <- (p_of_best_pic_and_get_nom_in_i / p_of_getting_nom_in_i)
category_id[i] <- as.numeric(substr(names(temp)[i],2,3))
category_name[i] <- as.vector(award_categories[award_categories$category_id==(category_id[i]),2])
}
prob_of_best_pic_given_cat_nom <- data.frame(Award_Category_ID=category_id,Award_Category_Name=category_name,Probability_Percent=(p*100))
prob_of_best_pic_given_cat_nom
## Award_Category_ID Award_Category_Name Probability_Percent
## 1 1 ACTOR -- LEADING ROLE 13.417722
## 2 2 ACTOR -- SUPPORTING ROLE 12.676056
## 3 3 ACTRESS -- LEADING ROLE 6.403941
## 4 4 ACTRESS -- SUPPORTING ROLE 8.211144
## 5 5 ANIMATED FEATURE FILM 0.000000
## 6 6 ART DIRECTION 7.818182
## 7 7 CINEMATOGRAPHY 9.265734
## 8 8 COSTUME DESIGN 8.333333
## 9 9 DIRECTING 19.070905
## 10 10 DOCUMENTARY (FEATURE) 0.000000
## 11 11 DOCUMENTARY (SHORT SUBJECT) 0.000000
## 12 12 FILM EDITING 17.662338
## 13 13 MAKEUP 11.627907
## 14 14 MUSIC (SCORING) 6.477733
## 15 15 MUSIC (SONG) 2.035623
## 16 17 SHORT FILM (ANIMATED) 0.000000
## 17 18 SHORT FILM (LIVE ACTION) 0.000000
## 18 19 SOUND 9.734513
## 19 20 SOUND EDITING 6.730769
## 20 21 VISUAL EFFECTS 4.273504
## 21 22 WRITING 8.735632
## 22 23 DOCUMENTARY (OTHER) 0.000000
barplot(prob_of_best_pic_given_cat_nom$Probability_Percent,names.arg=prob_of_best_pic_given_cat_nom$Award_Category_ID,
xlab = "Category ID",ylab="Probability Percentage",col=1:23,
main='Probability(%) of getting Best Picture, given the nomination category',
ylim=c(0, 25)
)

The above display of the probabilities conclude that if a film is nominated in “Directing” category, then there is 19% probability of winning the Best Picture award. The “Directing” category is the best predictor for “Best Picture” award. The second best predictor is “Film Editing”, which predicts that if a film is nominated for “Film Editing” category, then there is 17.7% probability that it wins “Best Picture” award.
Let us compute the probabilities of getting “Best Picture” award to a movie, given that the movie wins award in a specific category (Objective_2)
The R code to get the conditional probabilities of getting the best picture, given that the movie wins in a specific category is given below. This code produces a data frame, with the following details:
Award_Category_ID
Award_Category_Name
Probability_Percent
The third variable (Probability_Percent) represents the probability of getting best picture award if the movie is wins in a specific award category.
R Code for Objective-2
for(i in 1:22)
{
x <- temp[,c(i,23)]
#Get the number of times a movie won in cat_i
num_of_wins_in_cat_i <- sum(x[which(!is.na(x[,1])),1] == 1)
#Get the number of times a movie got Best movie and also won cat_i
num_of_best_movies_given_cat_i_win <- sum(x[which(x[,2] == 1),1] == 1,na.rm=T)
#Probability of winning Best pic, given a picture wins in cat_i
p[i] <- (num_of_best_movies_given_cat_i_win / num_of_wins_in_cat_i)
category_id[i] <- as.numeric(substr(names(temp)[i],2,3))
category_name[i] <- as.vector(award_categories[award_categories$category_id==(category_id[i]),2])
}
prob_of_best_pic_given_a_cat_wins <- data.frame(Award_Category_ID=category_id,Award_Category_Name=category_name,Probability_Percent=(p*100))
prob_of_best_pic_given_a_cat_wins
## Award_Category_ID Award_Category_Name Probability_Percent
## 1 1 ACTOR -- LEADING ROLE 30.588235
## 2 2 ACTOR -- SUPPORTING ROLE 20.000000
## 3 3 ACTRESS -- LEADING ROLE 13.095238
## 4 4 ACTRESS -- SUPPORTING ROLE 16.000000
## 5 5 ANIMATED FEATURE FILM 0.000000
## 6 6 ART DIRECTION 25.000000
## 7 7 CINEMATOGRAPHY 21.929825
## 8 8 COSTUME DESIGN 23.750000
## 9 9 DIRECTING 71.764706
## 10 10 DOCUMENTARY (FEATURE) 0.000000
## 11 11 DOCUMENTARY (SHORT SUBJECT) 0.000000
## 12 12 FILM EDITING 42.857143
## 13 13 MAKEUP 12.903226
## 14 14 MUSIC (SCORING) 19.200000
## 15 15 MUSIC (SONG) 6.493506
## 16 17 SHORT FILM (ANIMATED) 0.000000
## 17 18 SHORT FILM (LIVE ACTION) 0.000000
## 18 19 SOUND 29.113924
## 19 20 SOUND EDITING 7.894737
## 20 21 VISUAL EFFECTS 8.333333
## 21 22 WRITING 30.508475
## 22 23 DOCUMENTARY (OTHER) 0.000000
barplot(prob_of_best_pic_given_a_cat_wins$Probability_Percent,names.arg=prob_of_best_pic_given_a_cat_wins$Award_Category_ID,
xlab = "Category ID",ylab="Probability Percentage",col=1:23,
main='Probability(%) of getting Best Picture, given award winner',
ylim=c(0, 100)
)

The above display concludes that, if a movie wins the “Directing” award, then there is 71.8% probability that it also wins “Best Picture”, and if a film wins “Film Editing” award, then there is 42.8% chance that it can also win “Best Film”. Hence “Directing” category predicts the best picture with a maximum probability.
probs <- rbind(prob_of_best_pic_given_cat_nom$Probability_Percent, prob_of_best_pic_given_a_cat_wins$Probability_Percent)
barplot(probs, main="Probabilities comparision (Nominated Vs. Winner)",
names.arg=prob_of_best_pic_given_a_cat_wins$Award_Category_ID,
xlab="Category ID", col=c("darkblue","green")
,beside=TRUE,
legend = (c("P(Best Pic | Nominated in x)",
"P(Best Pic | Wins in x)")))
