Raw Data: 7 Excel File with 45,000 movies includes Cast, Crew, Keywords, Ratings.

Take Crew File of one movie (movie ID = 1) as example, there is 107 crews in one cell. Each {….} means one crew of movie and there are 100+ crews in one movie. I only add 3 crew as example

{‘credit_id’: ‘52fe4284c3a36847f8024f49’, ‘department’: ‘Directing’, ‘gender’: 2, ‘id’: 7879, ‘job’: ‘Director’, ‘name’: ‘John Lasseter’, ‘profile_path’: ‘/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg’}, {‘credit_id’: ‘52fe4284c3a36847f8024f4f’, ‘department’: ‘Writing’, ‘gender’: 2, ‘id’: 12891, ‘job’: ‘Screenplay’, ‘name’: ‘Joss Whedon’, ‘profile_path’: ‘/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg’}, {‘credit_id’: ‘52fe4284c3a36847f8024f55’, ‘department’: ‘Writing’, ‘gender’: 2, ‘id’: 7, ‘job’: ‘Screenplay’, ‘name’: ‘Andrew Stanton’, ‘profile_path’: ‘/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg’} Clean Data Sample Code:

#split cast
# set working directory.
setwd("C:/Users/14695/Desktop/mast6251/HW2")
#clean credits
credits <- read.csv("Data/credits.csv")
cast <- credits$cast 
# a list to store result
base2 <- vector(mode = "list", length = nrow(credits))
# a loop to extract id and name
for (i in 1:nrow(credits)){
  # for each string
  string <- credits$cast[i]
  # split into id and name parts
  # because for the name, it may have numbers
  sub <- str_split(string = string, pattern = ", ")[[1]]
  # 1 is credit_id; 2 is department; 3 is gender; 4 is id; 5 is job
  # 6 is name; 7 is profile_path; 
  cast_id <- sub[1:length(sub) %% 8 == 1]
  character <- sub[1:length(sub) %% 8 == 2]
  credit_id <- sub[1:length(sub) %% 8 == 3] ## useless! remove!
  gender <- sub[1:length(sub) %% 8 == 4] ## useless! remove!
  nameid <- sub[1:length(sub) %% 8 == 5]
  name <- sub[1:length(sub) %% 8 == 6]
  order <- sub[1:length(sub) %% 8 == 7] 
  #########################################################
  # 1. extract numbers and split them into elements since numbers are unique to each id
  ## cast_id
  cast_id <- sapply(cast_id, function(x) regmatches(x, gregexpr("[[:digit:]]+", x))[[1]])
  matches <- regmatches(cast_id, gregexpr("[[:digit:]]+", cast_id))
  cast_id <- as.numeric(unlist(matches))
  ## name_id
  nameid <- sapply(nameid, function(x) regmatches(x, gregexpr("[[:digit:]]+", x))[[1]])
  matches <- regmatches(nameid, gregexpr("[[:digit:]]+", nameid))
  nameid <- as.numeric(unlist(matches))
  ## order
  order <- sapply(order, function(x) regmatches(x, gregexpr("[[:digit:]]+", x))[[1]])
  matches <- regmatches(order, gregexpr("[[:digit:]]+", order))
  order <- as.numeric(unlist(matches))
  # 2. extract name
  # I want to remove numbers, ', :, [], {}, and ,
  pattern <- c("'", '"', ":", "\\[|\\]", "\\{|\\}", ",")
  # a loop to replace those patterns sequentially
  ## character
  x <- character 
  for (j in pattern){
    # gsub() removes the pattern
    x <- sapply(x, function(y) gsub(pattern = j, replacement = "", x = y))
  }
  # further split
  # note that there are two spaces between id and name
  substr <- str_split(string = x, pattern = "name ")
  # remove the one with empty ""
  character <- sapply(substr, function(x) x[1])
  character <- substring(character, (nchar("character")+2))
  ## name
  x <- name 
  for (j in pattern){
    # gsub() removes the pattern
    x <- sapply(x, function(y) gsub(pattern = j, replacement = "", x = y))
  }
  # further split
  # note that there are two spaces between id and name
  substr <- str_split(string = x, pattern = "name ")
  # remove the one with empty ""
  name <- sapply(substr, function(x) x[2])

  ######### combine
  id <- rep(credits$id[i], length(name))
  output2 <- as.matrix(cbind(as.character(id),character,name,order))
  output2 <- 
  # output
  base2[[i]] <- output2
}

####### 
base3 <- lapply(base2, function(x) if (ncol(x) < 3) NULL else x)
ncol <- which(as.numeric(lapply(base3,ncol))          %in% c(4))
base4 <- base3[ncol]# keep each number of column in list is samee, so that I can row bind
idname2 <- do.call(what = "rbind", args = base4) ## matrix 
idname3 <- as.data.frame(idname2)
idname4 <- as.matrix(idname3[!rowSums(is.na(idname3))>0,])
names(idname4) <- c("id","cast character","cast name", "cast order")

Introduction and Clean Data

I have analyzed the dataset of 45000 movies released in 2017 and related information such as vote count, cast, keyword, budget and production companies. Our goal is to clean dataset, remove meaningless variable and build a logistic regression model to identify factors which affect the final quality of a movie, good or bad in other words. For data cleaning part, I split the string data in credits file to get the top 30 casts and directors, which helped to identify the effect casts or directors might have on movie rating. IMDB rating was obtained from official website and matched with our own dataset based on IMDB ID. Final rating was calculated accordingly.

The Data

First, median rate is set as the threshold for quality evaluation. Then, each recorded rating was left joined with movie ID and other related information. Third, I focused on the variables occurred before movie released such as cast, crew, genre, budget and runtime, rather than revenue and votes which only published after movie been released.

The two graphs show the mean of rating for good and bad movies over past 100 years. From the first graph, the average rating of good movie is around 7.1 and there are some extreme high rating movies during 1925-1965, after that period the ratings become stable. From the second graph, there is a slightly increasing trend before 1960 then the average of bad movies become stable around 5.5. Both graphs both show a tipping point in year 1965. Before 1965, the average rating for movies displays strong fluctuation. After 1965, the interval between good and bad movies rating become stable. The following table contain correlations between final_rate(y) and the five numerical variables. There is an omitted variable bias because correlation table only apply to numeric variables, rather than 0/1 variables (See Logistic Regression on Page 4). Note that numVotes is extracted from external IMDB website, and movies with zero budget and revenue were excluded. Interesting facts, although there is no obvious correlation between Y(Final rating) and variables in below chart; revenue shows highly correlation with budget, numVotes. It make sense since usually high budget can result in heavy marketing thereby affecting final revenue. The following plots chart shows revenue by genres. Most of the good and blockbuster movies can be categorized under animation, mystery, crime and music genre. Most bad movies are from western, adventure, documentary; movies from drama and war received bad rating with high revenue. Comparing both graphs I find the highest revenue achieved by good movie is almost 250 million dollars more than that of bad movies, which indicates that though bad movies could perform well in revenue, people are still willing to spend more on good movies. I created a wordcloud of tags to see if there is a difference between bad and good movies in keywords attribution. Unfortunately, the top 9 frequent keywords are distributed quite similar for good and bad. Hence a combind wordcloud was draw. Independent film (1424) ranks as 1st in terms of quantity, followed by woman director (964), murder (740) and sex (453). Although I could not distinguish the good/bad movie rating by keyword tags, the wordcloud provides a straightforward view of frequency and help film producer to identify popular themes when investing in new movie.
I selected top 30 directors and casts based on the quantity of movies they participated to see if there are some directors or casts who tend to produce good movies. Based on director name plot, Alfred Hitchcock reaped the highest number of good movies and Michael Curtiz has the highest number of bad movies. By the same token, in cast chart John Wayne starred most good movies and Gene Hackman acted most bad movies. These two graphs could help film investor to determine the possible quality for the movie more efficiently by looking up the cast and director. It coincides with the intuition that a good director or actor would possibly a guarantee of good movie. ## The Model To understand the relationship between numeric variables and dependent variable, I pick different numeric variables as independent variables to test which would generate the best fit model. When 5 numerical variables( runtime, Nummovies, budget, numVotes and revenue) were added into model, the AIC is around 7388 and the trend of change is limited. With insignificant P value on all numerical variables except runtime and number of movies released in certain year, runtime’s coefficient is -0.003 in multivariable and is 0.004 if runtime is the only one independent variable. Number of movies released in certain year has a -0.0007 coefficient. These suggest that in this model, longer runtime and larger number of movies would increase the probability of bad movies.

Then I create dummy variables for each categorical variable listed below in order to show how they affect whether a movie is good or not.

The Insights

From the data analysis and model regression above, I could now be able to tell if a movie is good or not given some information. Our analysis towards movie dataset could be concluded as following:

1: Movie quality has been stable and distinguishable in recent years. Unlike old movies, nowadays ratings of good movies and bad movies tend to polarize and cluster. A possible interpretation of this phenomenon is the maturity of movie industry and the success cultivation of movie-watching habits among audience. Audience with rich experience could appreciate and rate movies more sensibly. Moreover, movie production is now industrialized. The content quality depends on the ultimate goal for production companies, whether is to target niche market with a good story or is to cater to the mainstream market with a blockbuster.

2: A good start is important for directors. As I mentioned before, if a director has a high rating movie at the beginning of his career path, then it would be more likely for him to direct good movies afterwards than those who do not have one. This could be explained in two ways. First, if a director is talented, he would produce good movies regardless of if the movie has a large budget or not. Second, as a new director, if your first movie has a low rating, it is hard to convince people to invest in your following ideas as people would consider hard to get expected return. Considering both situations, the suggestion I could offer is to try really hard at your first movie as this is what a new director could control compared to talents.

3: Good movies enjoy positive cycles. According to regression results, longer runtime and larger number of votes could indicate it is probably a good movie. When I rethink this relationship, I could find out these factors influence each other in a cyclic way. If a movie is considered good, cinemas would usually extend its runtime in order to generate more revenue from it. When the runtime is extended, more people would have the chance to watch the movie and number of votes would increase both due to the increasing number of audience and number of people who would love the movie and want to offer a high rating to recommend the movie. In this way, a positive cycle is formed, and good movies would be differentiated from bad ones.

Potential Confounds: The biggest confounds is that our dependent variable and independent variables would influence each other as mentioned above. This could cause the inaccuracy of regression model since the definition of regression is that Y should be affected by X. However, this problem could hardly be resolved even in an ideally experimental background. I could not control the cyclic effect of good movies in any way. So this requires us to consider our inputs more carefully in further regression study, and maybe a new model or new independent variables could help with this issue.