Data Massaging

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

setwd("C:/Users/Sekhar/Documents/Github/Project_3_607_My_Work")
awards_df <- read.csv("Awards_File.csv",stringsAsFactors=F)
awards_modified <- awards_df[,c(1,3,4,6)]
#head(awards_modified)

The awards_modified data frame has some movies, which are nominated for the same category more than once, and has also won for one of its nominations (see below example)

awards_modified[awards_modified$movie_id==1288,]
##      movie_id year category_id won
## 2720     1288 1984           1   0
## 2721     1288 1984           1   1
## 2722     1288 1984           6   1
## 2723     1288 1984           7   0
## 2724     1288 1984           8   1
## 2725     1288 1984           9   1
## 2726     1288 1984          12   0
## 2727     1288 1984          13   1
## 2728     1288 1984          16   1
## 2729     1288 1984          19   1
## 2730     1288 1984          22   1

Display-1: Data showing the presence of multiple nominations for the same movie in the same category

To fix the above problem, we have to group by movie_id, year, category_id, and max(won) such grouping of data will include just the winning nomination, if a film is nominated multiple times in the same category, and one of the nominations wins in that category.

awards_modified <- awards_modified %>%
  group_by(movie_id,year,category_id) %>%
  summarise(won=max(won))

The above transformation has eliminated all the duplicate rows, where the same movie has been nominated in the same category more than once and one of them wins. It just includes the winning nomination. If none of the multiple nominations win, then only one of the nominations is included.

The below command confirms that for the movie_id=1288, only the winning nomination is inluded

awards_modified[awards_modified$movie_id==1288,]
## Source: local data frame [10 x 4]
## Groups: movie_id, year
## 
##    movie_id year category_id won
## 1      1288 1984           1   1
## 2      1288 1984           6   1
## 3      1288 1984           7   0
## 4      1288 1984           8   1
## 5      1288 1984           9   1
## 6      1288 1984          12   0
## 7      1288 1984          13   1
## 8      1288 1984          16   1
## 9      1288 1984          19   1
## 10     1288 1984          22   1

Display-2: Data showing the elimination of multiple nominations from the same movie, in the same award category

Now the spread() function can be applied to awards_modified data frame.

awards_re_modified <- spread((awards_modified),category_id,won)

Applying the spread() function to obtain the data for final analysis

The above command will produce another data frame called awards_re_modified, with the following columns: movie_id - Movie IDentifier Year - Year of award 1 to 23 - Award categories from 1 to 23 Wherever a movie wins an award, the respective category will have 1, wherever a movie is nominated, then the corresponding category will have 0, and wherever the movie is neither nominated nor wins, we will have NA

head(awards_re_modified)
## Source: local data frame [6 x 25]
## 
##   movie_id year  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
## 1        1 2010  0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 2        2 2010  0 NA NA  0 NA  0  0  0  0 NA NA NA NA NA NA  0 NA NA  0
## 3        3 2010  0 NA NA NA NA NA  0 NA  0 NA NA  1 NA  1 NA  0 NA NA  0
## 4        4 2010  1  0 NA  0 NA  0  0  0  1 NA NA  0 NA  0 NA  1 NA NA  0
## 5        5 2010  0 NA NA NA NA NA NA NA NA NA NA  0 NA  0  0  0 NA NA NA
## 6        6 2010 NA  1 NA  1 NA NA NA NA  0 NA NA  0 NA NA NA  0 NA NA NA
## Variables not shown: 20 (int), 21 (int), 22 (int), 23 (int)

Display-3: Sample data from the final data frame, which will be used for analysis

To get the category names (since the above dataframe contains the categories as numbers/codes), let us create a separate data frame award_categories:

award_categories <- unique(data.frame(category_id=awards_df$category_id,category_name=awards_df$category_name))
rownames(award_categories) <- NULL

award_categories
##    category_id               category_name
## 1            1       ACTOR -- LEADING ROLE
## 2            4  ACTRESS -- SUPPORTING ROLE
## 3            6               ART DIRECTION
## 4            7              CINEMATOGRAPHY
## 5            8              COSTUME DESIGN
## 6            9                   DIRECTING
## 7           16                BEST PICTURE
## 8           19                       SOUND
## 9           20               SOUND EDITING
## 10          22                     WRITING
## 11          12                FILM EDITING
## 12          14             MUSIC (SCORING)
## 13           2    ACTOR -- SUPPORTING ROLE
## 14          15                MUSIC (SONG)
## 15           3     ACTRESS -- LEADING ROLE
## 16           5       ANIMATED FEATURE FILM
## 17          21              VISUAL EFFECTS
## 18          10       DOCUMENTARY (FEATURE)
## 19          11 DOCUMENTARY (SHORT SUBJECT)
## 20          13                      MAKEUP
## 21          17       SHORT FILM (ANIMATED)
## 22          18    SHORT FILM (LIVE ACTION)
## 23          23         DOCUMENTARY (OTHER)

Display-4: award_categories data frame

Since the variables are not allowed to start with numeric values, let us rename the column names of the data frame awards_re_modified to character variables. For example, the variable name “1” represents the award category_ID 1. We will change this to “c1”“, to represent Award category-1.

names(awards_re_modified) <- c("movie_id", "year",  "c1",  "c2",  "c3",  "c4",  "c5", "c6",  "c7",  "c8",  "c9", "c10", "c11", "c12",
                               "c13", "c14", "c15","c16", "c17", "c18", "c19", "c20", "c21", "c22", "c23")

So finally we have the following data frames to work/use for our analysis: award_categories awards_re_modified

Main Objectives of the analysis

The Best Picture category_id is 16 and Film Editing category_id is 12. We have to identify which categories can help us to predict if a film gets the best picture, and determine if film_editing has the maximum probability.

Our specific objectives are:

Data Analysis process

Let us use conditional probability to determine the following:

Objective-1 Probability of getting Best Movie, given that the movie is “nominated”" in a paricular award category?

Objective-2 Probability of getting Best Movie, given that the movie “wins”" in a paricular award category?

Bayes formula to find conditional probability:

The formula for conditional probability is given below: \[P(A|B) = P(B|A) . P(A)/P(B) = P(AB) / P(B)\]

In other words …. we will use the following formulas to fulfill the objectives of our analysis.

For Objective-1:

P(Best Movie | It is nominated in x category) = P(that the movie won the Best picture award and also nominated in x category) / P(that a movie is nominated in x category)

For Objective-2:

P(Best Movie | (It wins the award in category x)) = P(that the movie won the Best picture award and also wins award in x category) / P(that a movie wins the award in x category)

Let us create a temp datafame, and read all the awards information, dropping the movie_id and year. We will also move the c16 (Best picture) award category as the last variable in temp data frame. This will help us to iterate via the columns easily, using for loop construct of R:

temp <- data.frame()
temp <- data.frame(awards_re_modified[,3:25])
temp$c0 <- temp$c16
temp$c16 <- NULL
names(temp)[23] <- "c16"

#probability_if_nominated <- data.frame(norow=22,ncol=2)

Let us display some sample rows of temp:

head(temp)
##   c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c17 c18 c19 c20 c21
## 1  0 NA NA NA NA NA NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
## 2  0 NA NA  0 NA  0  0  0  0  NA  NA  NA  NA  NA  NA  NA  NA   0   0  NA
## 3  0 NA NA NA NA NA  0 NA  0  NA  NA   1  NA   1  NA  NA  NA   0  NA  NA
## 4  1  0 NA  0 NA  0  0  0  1  NA  NA   0  NA   0  NA  NA  NA   0  NA  NA
## 5  0 NA NA NA NA NA NA NA NA  NA  NA   0  NA   0   0  NA  NA  NA  NA  NA
## 6 NA  1 NA  1 NA NA NA NA  0  NA  NA   0  NA  NA  NA  NA  NA  NA  NA  NA
##   c22 c23 c16
## 1  NA  NA  NA
## 2   0  NA   0
## 3   1  NA   0
## 4   1  NA   1
## 5   0  NA   0
## 6   0  NA   0

The R code to get the conditional probabilities of getting the best picture, given that the movie is nominated in a specific category is given below. This code produces a data frame, with the following details:

*Award_Category_ID

*Award_Category_Name

*Probability_Percent

The third variable (Probability_Percent) represents the probability of getting best picture award if the movie is nominated in a specific award category.

R Code for Objective-1

p <- vector(length=22)
category_id <- vector(length=22)
category_name <- vector(length=22)
category_name <- NULL
for(i in 1:22)
{

  x <- temp[,c(i,23)]
  #Get the probability of getting a movie nominated in category i
  #First get the number of movies, and then get the number of movies for which the category_i is 1 or 0
  
  category_nominated_count <- sum(!is.na(x[,1])) #if category_i is 0 or 1, then the movie is nominated
  Exhaustive_movies_count <- nrow(x)
  #Hence the probability of getting a movie nominated is ...
  p_of_getting_nom_in_i <- (category_nominated_count / Exhaustive_movies_count)
  
  
  #Now get the count of movies which won the best movie given that they were nominated in category_i
  p_of_best_pic_and_get_nom_in_i <- (sum(!is.na((x[which(x[,2] == 1),])[1])) / nrow(x))
      
  #Now, the probability of getting best movie given that it is nominated in category_i
   p[i] <- (p_of_best_pic_and_get_nom_in_i / p_of_getting_nom_in_i) 
  category_id[i] <- as.numeric(substr(names(temp)[i],2,3))
  category_name[i] <- as.vector(award_categories[award_categories$category_id==(category_id[i]),2])
  
  
}

prob_of_best_pic_given_cat_nom <- data.frame(Award_Category_ID=category_id,Award_Category_Name=category_name,Probability_Percent=(p*100))

prob_of_best_pic_given_cat_nom 
##    Award_Category_ID         Award_Category_Name Probability_Percent
## 1                  1       ACTOR -- LEADING ROLE           13.417722
## 2                  2    ACTOR -- SUPPORTING ROLE           12.676056
## 3                  3     ACTRESS -- LEADING ROLE            6.403941
## 4                  4  ACTRESS -- SUPPORTING ROLE            8.211144
## 5                  5       ANIMATED FEATURE FILM            0.000000
## 6                  6               ART DIRECTION            7.818182
## 7                  7              CINEMATOGRAPHY            9.265734
## 8                  8              COSTUME DESIGN            8.333333
## 9                  9                   DIRECTING           19.070905
## 10                10       DOCUMENTARY (FEATURE)            0.000000
## 11                11 DOCUMENTARY (SHORT SUBJECT)            0.000000
## 12                12                FILM EDITING           17.662338
## 13                13                      MAKEUP           11.627907
## 14                14             MUSIC (SCORING)            6.477733
## 15                15                MUSIC (SONG)            2.035623
## 16                17       SHORT FILM (ANIMATED)            0.000000
## 17                18    SHORT FILM (LIVE ACTION)            0.000000
## 18                19                       SOUND            9.734513
## 19                20               SOUND EDITING            6.730769
## 20                21              VISUAL EFFECTS            4.273504
## 21                22                     WRITING            8.735632
## 22                23         DOCUMENTARY (OTHER)            0.000000
barplot(prob_of_best_pic_given_cat_nom$Probability_Percent,names.arg=prob_of_best_pic_given_cat_nom$Award_Category_ID, 
        xlab = "Category ID",ylab="Probability Percentage",col=1:23,
        main='Probability(%) of getting Best Picture, given the nomination category',
        ylim=c(0, 25) 
          )

The above display of the probabilities conclude that if a film is nominated in “Directing” category, then there is 19% probability of winning the Best Picture award. The “Directing” category is the best predictor for “Best Picture” award. The second best predictor is “Film Editing”, which predicts that if a film is nominated for “Film Editing” category, then there is 17.7% probability that it wins “Best Picture” award.

Let us compute the probabilities of getting “Best Picture” award to a movie, given that the movie wins award in a specific category (Objective_2)

The R code to get the conditional probabilities of getting the best picture, given that the movie wins in a specific category is given below. This code produces a data frame, with the following details:

  • Award_Category_ID

  • Award_Category_Name

  • Probability_Percent

The third variable (Probability_Percent) represents the probability of getting best picture award if the movie is wins in a specific award category.

R Code for Objective-2

for(i in 1:22)
{
  
  x <- temp[,c(i,23)]
  #Get the number of times a movie won in cat_i
  num_of_wins_in_cat_i <- sum(x[which(!is.na(x[,1])),1] == 1)
  
  #Get the number of times a movie got Best movie and also won cat_i
  num_of_best_movies_given_cat_i_win <- sum(x[which(x[,2] == 1),1] == 1,na.rm=T)
  
  #Probability of winning Best pic, given a picture wins in cat_i
  p[i] <- (num_of_best_movies_given_cat_i_win / num_of_wins_in_cat_i)
  
  category_id[i] <- as.numeric(substr(names(temp)[i],2,3))
  category_name[i] <- as.vector(award_categories[award_categories$category_id==(category_id[i]),2])
  
}


prob_of_best_pic_given_a_cat_wins <- data.frame(Award_Category_ID=category_id,Award_Category_Name=category_name,Probability_Percent=(p*100))
prob_of_best_pic_given_a_cat_wins
##    Award_Category_ID         Award_Category_Name Probability_Percent
## 1                  1       ACTOR -- LEADING ROLE           30.588235
## 2                  2    ACTOR -- SUPPORTING ROLE           20.000000
## 3                  3     ACTRESS -- LEADING ROLE           13.095238
## 4                  4  ACTRESS -- SUPPORTING ROLE           16.000000
## 5                  5       ANIMATED FEATURE FILM            0.000000
## 6                  6               ART DIRECTION           25.000000
## 7                  7              CINEMATOGRAPHY           21.929825
## 8                  8              COSTUME DESIGN           23.750000
## 9                  9                   DIRECTING           71.764706
## 10                10       DOCUMENTARY (FEATURE)            0.000000
## 11                11 DOCUMENTARY (SHORT SUBJECT)            0.000000
## 12                12                FILM EDITING           42.857143
## 13                13                      MAKEUP           12.903226
## 14                14             MUSIC (SCORING)           19.200000
## 15                15                MUSIC (SONG)            6.493506
## 16                17       SHORT FILM (ANIMATED)            0.000000
## 17                18    SHORT FILM (LIVE ACTION)            0.000000
## 18                19                       SOUND           29.113924
## 19                20               SOUND EDITING            7.894737
## 20                21              VISUAL EFFECTS            8.333333
## 21                22                     WRITING           30.508475
## 22                23         DOCUMENTARY (OTHER)            0.000000
barplot(prob_of_best_pic_given_a_cat_wins$Probability_Percent,names.arg=prob_of_best_pic_given_a_cat_wins$Award_Category_ID,
        xlab = "Category ID",ylab="Probability Percentage",col=1:23,
        main='Probability(%) of getting Best Picture, given award winner',
        ylim=c(0, 100) 
          )

The above display concludes that, if a movie wins the “Directing” award, then there is 71.8% probability that it also wins “Best Picture”, and if a film wins “Film Editing” award, then there is 42.8% chance that it can also win “Best Film”. Hence “Directing” category predicts the best picture with a maximum probability.

probs <- rbind(prob_of_best_pic_given_cat_nom$Probability_Percent, prob_of_best_pic_given_a_cat_wins$Probability_Percent)
 barplot(probs, main="Probabilities comparision (Nominated Vs. Winner)",
         names.arg=prob_of_best_pic_given_a_cat_wins$Award_Category_ID,
   xlab="Category ID", col=c("darkblue","green")
,beside=TRUE,
       legend = (c("P(Best Pic | Nominated in x)",
                  "P(Best Pic | Wins in x)")))

Analysis summary

The following conclusions can be drawn from this analysis:

  • The “Driecting” award predicts the “Best Movie” award with maximum probability. If a film is “nominated” in the direction category, then there is a 19% chance that it gets “best picture” award. But if a movie “wins” the award in direction category, then there is 71.7% chance that it also wins best picture.

  • The second best predictor for “Best Film” is the “Film Editing” category. If a movie is “nominated” in editing category, then there is 17.67% that it gets best picture award. But if a movie wins “Film Editing” award, then there is 42.8% chance that it also wins “Best picture” award.

  • From the Probabilities comparison graph display above, we can infer that the probability of getting best picture award increases drastically if the movie “wins”" a specific category (when compared to the probability of getting best picture if the movie is “nominated”" in the same category).

  • If a film is nominated in Some categories (such as Animated feature films, documentaries, short films), then there is zero percent chance that it gets best picture award.

  • The probability that a movie gets best picture, if the movie wins “Actor Leading role” is much higher than the probability if the movie wins “Actress leading role”.

  • In fact “Actress supporting role” category has more influence on “Best picture” than the “Actress leading role” category. If a movie wins “Actress supporting role” award, then there is 16% chance that it also wins “Best picture”, while if a movie wins “Actress Leading role”, then there is only 13% chance that it also wins “Best picture”

Future work