Homework 6 : Samhith Barlaya

Submission for Homework 6

Samhith Barlaya
2022-01-20

The following are my research questions and their respective plots:

library(dplyr)
library(ggplot2)

movie_data <- read.csv("C:/Users/gbsam/Desktop/movie_metadata.csv")

country_summary <- movie_data %>% 
  group_by(country) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score), 
            n_ = n()) %>% top_n(10, median_rating) 

barPlot <- ggplot(country_summary, aes(reorder(country, -median_rating), median_rating)) + 
                   geom_col() +  
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2)

barPlot + labs(y="Movie IMDB Rating, with uncertainity", x = "Country") 

Many countries listed here have very less review count (sometimes just one), which gives us a skewed result. Instead, I choose 10 countries that have highest review count and then plot the above graph, which gives us a more realistic result.

country_summary <- movie_data %>% 
  group_by(country) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score),
            n_ = n()) %>% top_n(10, n_) 

barPlot <- ggplot(country_summary, aes(reorder(country, -median_rating), median_rating)) + 
                   geom_col() +  
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2)

barPlot + labs(y="Movie IMDB Rating, with uncertainity", x = "Country") 

To answer my research question “Does the duration of a movie impact its popularity?” I try to plot a line graph of movie duration vs IMDB rating.

ggplot(data=movie_data, aes(x=duration, y=imdb_score, group=1)) + geom_smooth()

There are too many points here to make any conclusions. Hence, I try to categorize the movies by their language.

ggplot(subset(movie_data, language %in% c('English', 'Cantonese', 'French','German', 'Japanese', 'Italian', 'Mandarin', 'Spanish')), aes(x=duration, y=imdb_score, group=1)) + geom_smooth() + facet_wrap(vars(language)) 

I also try to find out the ratio of content ratings in each year of movies. I also scale the graph to highlight the relevant parts of the graph to the user.

ggplot(data = movie_data,aes(x = title_year, fill = content_rating)) + geom_bar() + xlim(c(1990,NA)) + ggtitle("Plot of number of movies per year and content rating share per year ") +
  xlab("Years") + ylab("No of movies")

We see an interesting similarity of trend between the first and second plot - that is, the movie rating seems to have a similar variation in IMDB rating with the increase in either Actor 1’s facebook likes or Actor 2’s facebook likes.

ggplot(data=movie_data, aes(x=actor_1_facebook_likes, y=imdb_score, group=1)) + geom_smooth()
ggplot(data=movie_data, aes(x=actor_2_facebook_likes, y=imdb_score, group=1)) + geom_smooth()
ggplot(data=movie_data, aes(x=actor_3_facebook_likes, y=imdb_score, group=1)) + geom_smooth()

Does presence of an actor boost a movie’s ratings?

In order to answer this, I try to get the top 15 most popular actors when being listed as Actor 1, Actor 2 or Actor 3 in a movie. We can see that some names like Morgan Freeman, Steve Buscami, Bruce Willis appear on mulitple plots, which seem to indicate that their presence in a movie has some affect on its rating.

country_summary <- movie_data %>% 
  group_by(actor_1_name) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score),
            n_ = n()) %>% top_n(15, n_) 

ggplot(country_summary, aes(reorder(actor_1_name, -median_rating), median_rating)) + 
                   geom_col() +  theme(axis.text.x=element_text(angle=90)) + 
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2) + ggtitle("Plot of top 15 actors by median IMDB rating of their movies ") +
  xlab("Actors") + ylab("IMDB Median Rating")
country_summary <- movie_data %>% 
  group_by(actor_2_name) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score),
            n_ = n()) %>% top_n(15, n_) 

ggplot(country_summary, aes(reorder(actor_2_name, -median_rating), median_rating)) + 
                   geom_col() +  theme(axis.text.x=element_text(angle=90)) + 
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2) + ggtitle("Plot of top 15 actors by median IMDB rating of their movies ") +
  xlab("Actors") + ylab("IMDB Median Rating")
country_summary <- movie_data %>% 
  group_by(actor_3_name) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score),
            n_ = n()) %>% top_n(15, n_) 

ggplot(country_summary, aes(reorder(actor_3_name, -median_rating), median_rating)) + 
                   geom_col() +  theme(axis.text.x=element_text(angle=90)) + 
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2) + ggtitle("Plot of top 15 actors by median IMDB rating of their movies ") +
  xlab("Actors") + ylab("IMDB Median Rating")

What is missing from your final project?

What do you hope to accomplish between now and submission time?