This is the Metadata guide for IMDB Rating Differential Study datafile “2017-05-05 films.csv” This dataset was created at 2017-05-05 10:46:24 by Emily Kothe.
This Metadata Guide provides a brief summary that describes the data and it’s source as well as instructions and code required to regenerate the data. Data were collected to evaluate the premise that anime films are over-rated relative to non-anime films. Running this code will generate and save a new version of this datafile.
# required packages
library("rvest")
library("tidyverse")
library("stringr")
Download data for the anime dataset
# This search returns animated films with the anime keyword. Films have been
# released, are classified as feature or shorts and are sorted by number of votes
SearchURL <- read_html("http://www.imdb.com/search/title?count=100&genres=animation&keywords=anime&production_status=released&title_type=feature,short&sort=num_votes,desc")
Title <- SearchURL %>% html_nodes(".lister-item-header a") %>% html_text()
Rating <- SearchURL %>% html_nodes(".ratings-imdb-rating strong") %>% html_text()
Rating <- as.numeric(Rating)
RatingsBar <- SearchURL %>% html_nodes(".ratings-bar") %>% html_text()
Votes <- SearchURL %>% html_nodes(".sort-num_votes-visible span:nth-child(2)") %>%
html_text()
Votes <- as.numeric(gsub(",", "", Votes))
Create the anime dataframe
## This combines the search results in multiple steps
# 1. Clean up the ratings bar so we only have the Metascore value remainings
# 2. Combine Title, Rating, Votes, and Metascore
# Step 1 Clean up the ratings bar so we only have the Metascore value remainings
RatingsBar_Clean <- str_replace_all(RatingsBar, "[\\s]", "")
RatingsBar_Clean_2 <- str_extract(RatingsBar_Clean, "X.*?Metascore")
RatingsBar_Clean_3 <- str_extract(RatingsBar_Clean_2, "\\(?[0-9,.]+\\)?")
Metascore <- as.numeric(RatingsBar_Clean_3)
# Step 2
anime <- cbind.data.frame(Title, Rating, Votes, Metascore)
Download data for the all films dataset
# This search returns films with no restrictions. Films have been released, are
# classified as feature or shorts and are sorted by number of votes
SearchURL <- read_html("http://www.imdb.com/search/title?count=100&production_status=released&title_type=feature,short&sort=num_votes,desc")
Title <- SearchURL %>% html_nodes(".lister-item-header a") %>% html_text()
Rating <- SearchURL %>% html_nodes(".ratings-imdb-rating strong") %>% html_text()
Rating <- as.numeric(Rating)
RatingsBar <- SearchURL %>% html_nodes(".ratings-bar") %>% html_text()
Votes <- SearchURL %>% html_nodes(".sort-num_votes-visible span:nth-child(2)") %>%
html_text()
Votes <- as.numeric(gsub(",", "", Votes))
Create the all films dataframe
## This combines the search results in multiple steps
# 1. Clean up the ratings bar so we only have the Metascore value remainings
# 2. Combine Title, Rating, Votes, and Metascore
# Step 1 Clean up the ratings bar so we only have the Metascore value remainings
RatingsBar_Clean <- str_replace_all(RatingsBar, "[\\s]", "")
RatingsBar_Clean_2 <- str_extract(RatingsBar_Clean, "X.*?Metascore")
RatingsBar_Clean_3 <- str_extract(RatingsBar_Clean_2, "\\(?[0-9,.]+\\)?")
Metascore <- as.numeric(RatingsBar_Clean_3)
# Step 2 Combine Title, Rating, Votes, and Metascore
all.films <- cbind.data.frame(Title, Rating, Votes, Metascore)
anime$Genre <- "Anime"
all.films$Genre <- "All"
films <- rbind(anime, all.films)
write.csv(films, file = paste(Time, "films.csv"))
As per registration (https://aspredicted.org/3sh38.pdf) data relates to the 100 films with the most IMDB votes that are classified as “anime” and the 100 films with the most IMDB votes that are classified as not anime.
The following data is retained for each film Title, Rating, Votes, Metascore, Genre