The Movies Dataset

Raw Data: 7 Excel File with 45,000 movies includes Cast, Crew, Keywords, Ratings.

Take Crew File of one movie (movie ID = 1) as example, there is 107 crews in one cell. Each {….} means one crew of movie and there are 100+ crews in one movie. I only add 3 crew as example

{‘credit_id’: ‘52fe4284c3a36847f8024f49’, ‘department’: ‘Directing’, ‘gender’: 2, ‘id’: 7879, ‘job’: ‘Director’, ‘name’: ‘John Lasseter’, ‘profile_path’: ‘/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg’}, {‘credit_id’: ‘52fe4284c3a36847f8024f4f’, ‘department’: ‘Writing’, ‘gender’: 2, ‘id’: 12891, ‘job’: ‘Screenplay’, ‘name’: ‘Joss Whedon’, ‘profile_path’: ‘/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg’}, {‘credit_id’: ‘52fe4284c3a36847f8024f55’, ‘department’: ‘Writing’, ‘gender’: 2, ‘id’: 7, ‘job’: ‘Screenplay’, ‘name’: ‘Andrew Stanton’, ‘profile_path’: ‘/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg’} Clean Data Sample Code:

#split cast
# set working directory.
setwd("C:/Users/14695/Desktop/mast6251/HW2")
#clean credits
credits <- read.csv("Data/credits.csv")
cast <- credits$cast 
# a list to store result
base2 <- vector(mode = "list", length = nrow(credits))
# a loop to extract id and name
for (i in 1:nrow(credits)){
  # for each string
  string <- credits$cast[i]
  # split into id and name parts
  # because for the name, it may have numbers
  sub <- str_split(string = string, pattern = ", ")[[1]]
  # 1 is credit_id; 2 is department; 3 is gender; 4 is id; 5 is job
  # 6 is name; 7 is profile_path; 
  cast_id <- sub[1:length(sub) %% 8 == 1]
  character <- sub[1:length(sub) %% 8 == 2]
  credit_id <- sub[1:length(sub) %% 8 == 3] ## useless! remove!
  gender <- sub[1:length(sub) %% 8 == 4] ## useless! remove!
  nameid <- sub[1:length(sub) %% 8 == 5]
  name <- sub[1:length(sub) %% 8 == 6]
  order <- sub[1:length(sub) %% 8 == 7] 
  #########################################################
  # 1. extract numbers and split them into elements since numbers are unique to each id
  ## cast_id
  cast_id <- sapply(cast_id, function(x) regmatches(x, gregexpr("[[:digit:]]+", x))[[1]])
  matches <- regmatches(cast_id, gregexpr("[[:digit:]]+", cast_id))
  cast_id <- as.numeric(unlist(matches))
  ## name_id
  nameid <- sapply(nameid, function(x) regmatches(x, gregexpr("[[:digit:]]+", x))[[1]])
  matches <- regmatches(nameid, gregexpr("[[:digit:]]+", nameid))
  nameid <- as.numeric(unlist(matches))
  ## order
  order <- sapply(order, function(x) regmatches(x, gregexpr("[[:digit:]]+", x))[[1]])
  matches <- regmatches(order, gregexpr("[[:digit:]]+", order))
  order <- as.numeric(unlist(matches))
  # 2. extract name
  # I want to remove numbers, ', :, [], {}, and ,
  pattern <- c("'", '"', ":", "\\[|\\]", "\\{|\\}", ",")
  # a loop to replace those patterns sequentially
  ## character
  x <- character 
  for (j in pattern){
    # gsub() removes the pattern
    x <- sapply(x, function(y) gsub(pattern = j, replacement = "", x = y))
  }
  # further split
  # note that there are two spaces between id and name
  substr <- str_split(string = x, pattern = "name ")
  # remove the one with empty ""
  character <- sapply(substr, function(x) x[1])
  character <- substring(character, (nchar("character")+2))
  ## name
  x <- name 
  for (j in pattern){
    # gsub() removes the pattern
    x <- sapply(x, function(y) gsub(pattern = j, replacement = "", x = y))
  }
  # further split
  # note that there are two spaces between id and name
  substr <- str_split(string = x, pattern = "name ")
  # remove the one with empty ""
  name <- sapply(substr, function(x) x[2])

  ######### combine
  id <- rep(credits$id[i], length(name))
  output2 <- as.matrix(cbind(as.character(id),character,name,order))
  output2 <- 
  # output
  base2[[i]] <- output2
}

####### 
base3 <- lapply(base2, function(x) if (ncol(x) < 3) NULL else x)
ncol <- which(as.numeric(lapply(base3,ncol))          %in% c(4))
base4 <- base3[ncol]# keep each number of column in list is samee, so that I can row bind
idname2 <- do.call(what = "rbind", args = base4) ## matrix 
idname3 <- as.data.frame(idname2)
idname4 <- as.matrix(idname3[!rowSums(is.na(idname3))>0,])
names(idname4) <- c("id","cast character","cast name", "cast order")

Introduction and Clean Data

I have analyzed the dataset of 45000 movies released in 2017 and related information such as vote count, cast, keyword, budget and production companies. Our goal is to clean dataset, remove meaningless variable and build a logistic regression model to identify factors which affect the final quality of a movie, good or bad in other words. For data cleaning part, I split the string data in credits file to get the top 30 casts and directors, which helped to identify the effect casts or directors might have on movie rating. IMDB rating was obtained from official website and matched with our own dataset based on IMDB ID. Final rating was calculated accordingly.

The Data

First, median rate is set as the threshold for quality evaluation. Then, each recorded rating was left joined with movie ID and other related information. Third, I focused on the variables occurred before movie released such as cast, crew, genre, budget and runtime, rather than revenue and votes which only published after movie been released.

The two graphs show the mean of rating for good and bad movies over past 100 years. From the first graph, the average rating of good movie is around 7.1 and there are some extreme high rating movies during 1925-1965, after that period the ratings become stable. From the second graph, there is a slightly increasing trend before 1960 then the average of bad movies become stable around 5.5. Both graphs both show a tipping point in year 1965. Before 1965, the average rating for movies displays strong fluctuation. After 1965, the interval between good and bad movies rating become stable. The following table contain correlations between final_rate(y) and the five numerical variables. There is an omitted variable bias because correlation table only apply to numeric variables, rather than 0/1 variables (See Logistic Regression on Page 4). Note that numVotes is extracted from external IMDB website, and movies with zero budget and revenue were excluded. Interesting facts, although there is no obvious correlation between Y(Final rating) and variables in below chart; revenue shows highly correlation with budget, numVotes. It make sense since usually high budget can result in heavy marketing thereby affecting final revenue. The following plots chart shows revenue by genres. Most of the good and blockbuster movies can be categorized under animation, mystery, crime and music genre. Most bad movies are from western, adventure, documentary; movies from drama and war received bad rating with high revenue. Comparing both graphs I find the highest revenue achieved by good movie is almost 250 million dollars more than that of bad movies, which indicates that though bad movies could perform well in revenue, people are still willing to spend more on good movies. I created a wordcloud of tags to see if there is a difference between bad and good movies in keywords attribution. Unfortunately, the top 9 frequent keywords are distributed quite similar for good and bad. Hence a combind wordcloud was draw. Independent film (1424) ranks as 1st in terms of quantity, followed by woman director (964), murder (740) and sex (453). Although I could not distinguish the good/bad movie rating by keyword tags, the wordcloud provides a straightforward view of frequency and help film producer to identify popular themes when investing in new movie.
I selected top 30 directors and casts based on the quantity of movies they participated to see if there are some directors or casts who tend to produce good movies. Based on director name plot, Alfred Hitchcock reaped the highest number of good movies and Michael Curtiz has the highest number of bad movies. By the same token, in cast chart John Wayne starred most good movies and Gene Hackman acted most bad movies. These two graphs could help film investor to determine the possible quality for the movie more efficiently by looking up the cast and director. It coincides with the intuition that a good director or actor would possibly a guarantee of good movie. ## The Model To understand the relationship between numeric variables and dependent variable, I pick different numeric variables as independent variables to test which would generate the best fit model. When 5 numerical variables( runtime, Nummovies, budget, numVotes and revenue) were added into model, the AIC is around 7388 and the trend of change is limited. With insignificant P value on all numerical variables except runtime and number of movies released in certain year, runtime’s coefficient is -0.003 in multivariable and is 0.004 if runtime is the only one independent variable. Number of movies released in certain year has a -0.0007 coefficient. These suggest that in this model, longer runtime and larger number of movies would increase the probability of bad movies.

Then I create dummy variables for each categorical variable listed below in order to show how they affect whether a movie is good or not.

Genres: I include 20 kinds of genres and only Documentary and Science Fiction has significant P-value. The coefficient of Documentary is -0.5, while Science Fiction is 0.28. The result suggests that Documentary probably accounts for a higher probability of bad movie, but Science Fiction is more likely to be considered good movies. This may be due to the expectation of audience who watch the documentary is higher than other genres, thereby a high score becoming harder to obtain. Science Fiction is trending in recent years such as heroic movies by Marvel and DC. Young viewers enjoy fiction and fancy effect which makes science fiction easy to be considered a good one for these young reviewers.
Director: I select top 400 directors, each with at least 8 movies. The results are narrowed down to 14 directors out of 400 that p value <0.1; Interestingly, all coefficients are positive, meaning the more movie produced, the more good movies contributed. So, if directors start with good movies, they could attract more funds and become more famous. The second step is to adjust the number of directors. When I add top 100 directors by number of movies, it returns only 3 directors significantly; add top 300 and returns 16 directors, but when I enter top 500 directors, returns only 2 directors whose P-value is <0.1. So, the number of directors has no relation with output that displays the significant results of the number of directors.
Cast: I select top 100 cast names whose order is 0 or 1 in cast file, which indicates they are the leading characters, by the number of movies they participated. 14 have statistically significant P-value and all the coefficient is negative, meaning a negative effect on the movies. The next step is to apply these 14 casts (out of top 100) into top 20 casts list who produced most, and only two casts, Clint Eastwood and Gene Hackman, are on both lists. The main reason is probably that they collaborated into two movies, Absolute Power in 1997 and Unforgiven in 1992, which the rating is not so high unfortunately.
Tag: Now, I try to see which tag would cause actual effect on rating. Top 200 frequent tags were applied and only desert and hero have negative effects on rating.
Year and Number of Movies Released at that Year: The result shows not a single year is statistically significant, and when I use the number of movies released at the same year as input, the model only have a low estimated coefficient as -0.00007. So, the more movies released at that year, the higher probability of the movie becomes worse to some degree. Audience may be more sensible in rating when in face of intense competition in movie market.
Country: Only Belgium, Denmark and South Korea have significant negative coefficients, which means movies produced by these three countries tends to be considered bad. To dive deeper into this aspect, I add movies score by top 20 countries with total number of movies that country produced below. Interesting fact is the top 3 lowest lower quartile countries are Belgium, Denmark and South Korea. Meanwhile, Denmark and Belgium also achieve top 2 lowest median score among top 20 countries. So, I could tell that the model correctly interprets these three countries may produce more bad movies.
Budget and Runtime: Budget has a P-value of .6 when running logistic model on a Good or Bad Movie, which is not statistically significant. Runtime or the length of movie has a 0.004 estimated coefficient significantly. In other words, if I only consider the model itself, longer movie indicates higher possibility of good movie, which looks fishy here. So I draw a new plot below to glance at the relationship with longer time movie with budget for both good and bad. In conclusion, for movie longer than 175 mins, the higher budget, the higher possibility of good movie. But the relationship is not so strong for other length of time in a movie.
Number of Categorical Variables: I count the number of tag/ company/ language/genres/countries in a movie and apply on the logistic model. The result displays that only language has a significant P-value, with a coefficient of 0.098, meaning the more languages released, the higher probability of a good movie. However, this causal relationship remains suspicious. Since only good movies will be released into global market with two or more languages in normal sense.
Model Evaluation - Revenue and Votes: These two variables are after-released data because I can only predict a movie is good/bad before released. Revenues and votes in 2019, not 2017, were downloaded from IMDB.com. Movie popularity was adopted into model for the purpose of checking our movie classification. The coefficients are all positive with significant P value. Therefore a conclusion can be safely draw that the higher revenue or number of people votes on IMDB, the higher probability of a good movie. I only use rating and scores, without revenue or votes, when generate good/bad movie. So, our dependent variable is acceptable.

The Insights

From the data analysis and model regression above, I could now be able to tell if a movie is good or not given some information. Our analysis towards movie dataset could be concluded as following:

1: Movie quality has been stable and distinguishable in recent years. Unlike old movies, nowadays ratings of good movies and bad movies tend to polarize and cluster. A possible interpretation of this phenomenon is the maturity of movie industry and the success cultivation of movie-watching habits among audience. Audience with rich experience could appreciate and rate movies more sensibly. Moreover, movie production is now industrialized. The content quality depends on the ultimate goal for production companies, whether is to target niche market with a good story or is to cater to the mainstream market with a blockbuster.

2: A good start is important for directors. As I mentioned before, if a director has a high rating movie at the beginning of his career path, then it would be more likely for him to direct good movies afterwards than those who do not have one. This could be explained in two ways. First, if a director is talented, he would produce good movies regardless of if the movie has a large budget or not. Second, as a new director, if your first movie has a low rating, it is hard to convince people to invest in your following ideas as people would consider hard to get expected return. Considering both situations, the suggestion I could offer is to try really hard at your first movie as this is what a new director could control compared to talents.

3: Good movies enjoy positive cycles. According to regression results, longer runtime and larger number of votes could indicate it is probably a good movie. When I rethink this relationship, I could find out these factors influence each other in a cyclic way. If a movie is considered good, cinemas would usually extend its runtime in order to generate more revenue from it. When the runtime is extended, more people would have the chance to watch the movie and number of votes would increase both due to the increasing number of audience and number of people who would love the movie and want to offer a high rating to recommend the movie. In this way, a positive cycle is formed, and good movies would be differentiated from bad ones.

Potential Confounds: The biggest confounds is that our dependent variable and independent variables would influence each other as mentioned above. This could cause the inaccuracy of regression model since the definition of regression is that Y should be affected by X. However, this problem could hardly be resolved even in an ideally experimental background. I could not control the cyclic effect of good movies in any way. So this requires us to consider our inputs more carefully in further regression study, and maybe a new model or new independent variables could help with this issue.

The Movies Dataset

Jensen Xu"

Introduction and Clean Data

The Data

The Insights