Oscar Winners and Race Correlation

Mikayla Edwards

11/22/2020

Hypothesis

I found 2 datasets with Oscars data. The first is called Oscars nominated movies 2000-2017 (https://www.kaggle.com/vipulgote4/oscars-nominated-movies-from-2000-to-2017) and the second is demographics (https://www.kaggle.com/fmejia21/demographics-of-academy-awards-oscars-winners). I hypothesize is that movies with higher proportions of people of color are less likely to win an award in any of the 5 categories below:

Best Picture won
Best Director won
Oscars Best Actor Won
Oscars Best Actress Won
People Choice won

These categories were selected from the Oscars nominated movies dataset because they had both nomination and winner columns. They are also the most well-known categories and not as specific as some of the others such as 'London Critics Circle Film.' Because of this, I can optimize the amount of data I have for running my analysis. To look for films with people of color I used the race_simple column which respectively has people listed as "White" or a "POC", person of color. I also used Excel to add a column called POC_percent. This column was created by filtering the data by movie then dividing the number of POC's by the total number of characters listed.

Merging the Datasets

I noticed that the datasets have a common column movie/film so I used a left join to combine them.

library(readxl)
Oscar_nominated_Movies_2000_2017 <- read_excel("Desktop/Oscar nominated Movies 2000-2017.xlsx")
View(Oscar_nominated_Movies_2000_2017)

library(readxl)
demographics <- read_excel("Desktop/demographics.xlsx")
View(demographics)


library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Oscar_demographics <- inner_join(Oscar_nominated_Movies_2000_2017, demographics, by = c("movie" = "film"))

Cleaning the data

The merged data had a lot of columns I wasn't interested in so I deleted the excess columns.

Oscar_demographics <-Oscar_demographics[-c(1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17)]

Oscar_demographics <- Oscar_demographics[-c(3,5,7,9,11)]

Oscar_demographics <- Oscar_demographics[-c(8,9,10,11,12,13,14,15,16,17)]

Oscar_demographics <- Oscar_demographics[-c(8:18)]

Oscar_demographics <- Oscar_demographics[-c(8:25)]

Oscar_demographics <- Oscar_demographics[-c(8:15)]

Oscar_demographics <- Oscar_demographics[-c(8:11)]

Oscar_demographics <- Oscar_demographics[-c(9:40)]

Oscar_demographics <- Oscar_demographics[-c(9:19)]

Oscar_demographics <- Oscar_demographics[-c(9)]
Oscar_demographics <- Oscar_demographics[-c(6,7)]
View(Oscar_demographics)

## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1

Binary

Next, I changed all the columns of interest from Yes and No / White or POC to their respective binary values. *For race_simple I made POC=1 and White=0

Oscar_demographics$race_simple <- ifelse(Oscar_demographics$race_simple == "White", 0, 1)
Oscar_demographics$Oscar_Best_Picture_won <- ifelse(Oscar_demographics$Oscar_Best_Picture_won == "No", 0, 1)
Oscar_demographics$Oscar_Best_Actor_won <-ifelse(Oscar_demographics$Oscar_Best_Actor_won == "No", 0, 1)
Oscar_demographics$Oscar_Best_Actress_won <- ifelse(Oscar_demographics$Oscar_Best_Actress_won == "No", 0, 1)
Oscar_demographics$Oscar_Best_Director_won <- ifelse(Oscar_demographics$Oscar_Best_Director_won == "No", 0, 1)

View(Oscar_demographics)

## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1

Correlation

Next, instead of running separate correlation analyses for each column with POC_percent and race_simple, I decided to run a correlation matrix to see them all at once.

library(GGally)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggcorr(Oscar_demographics, low = "steelblue", mid = "white", high = "brown")

## Warning in ggcorr(Oscar_demographics, low = "steelblue", mid = "white", : data
## in column(s) 'movie', 'race' are not numeric and were ignored

It appears that the strongest correlations of POC_percent and race simple are both with the People_Choice_won category.

Example

An example of this correlation proving true can been seen in the graph above where "The Martian" had the highest POC_percent and won the People's Choice Award.

Results & Further Research

My hypothesis was proven incorrect as movies with higher proportions of people of color are more likely to win one of the five awards: the People's Choice Award. This is interesting because this is the only category that is decided by the general public and not the Academy. Further research with a larger dataset would be interesting as there could be different standards for what the general public considers as a 'good film' versus the Academy. It would also be interesting to explore the high positive correlation between best actor and best director.

Analyzing Twitter data including the words "people of color" or "diversity" and "Oscars" would also be interesting. I attempted this, however, less than 10 tweets were returned so it would need to be done through a premium API to get more than a week's worth of tweets.