October 1, 2019

Overview

The movies dataset consisted of 651 randomly sampled movies produced and released between 1970 and 2014. The dataset was generated from the Rotten Tomatoes and IMDb databases.

Data Manipulation

For this data analysis, five more varaibles must be created from four of the 32 existing variables. Sixteen explanatory variables are required in all for the modeling. The new five variables include:

  1. feature_film created from the variable title_type
  2. drama created from the variable genre
  3. mpaa_rating_R created from the variable mpaa_rating
  4. oscar_season created from the variable thtr_rel_month
  5. summer_season created from the variable thtr_rel_month

Exploratory data analysis

Out of the 32 variables in the movies dataset, we only require 17 variables including the response variable for this analysis. 9 of the 17 variables are listed below starting with the response variable.

  1. audience_score
  2. feature_film
  3. drama
  4. mpaa_rating_R
  5. oscar_season
  6. summer_season

Exploratory data analysis

  1. runtime
  2. thtr_rel_year
  3. imdb_rating
  4. imdb_num_votes
  5. critics_score
  6. best_pic_nom
  7. best_pic_win
  8. best_actor_win
  9. best_actress_win
  10. best_dir_win
  11. top200_box

Graphical visualization

Discussion

A close examination of the two forms of the set of boxplots showed that the plots of variable audience_score versus oscar_season, summer_season, drama, mpaa_rating, best_actor_winner, best_actress_winner, best_director_winner respectively, give no distinct trends between the response variable and the 'yes' or 'no' levels of the explanatory variables.

The plot of audience_score and feature_film gives useful information because level 'no' of the feature_film with its lowest 'no' values which is an outlier also correspond with level 'yes' lowest value which is not an outlier.

The plots of audience_score versus best_pic_nom, audience_score versus best_pic_win and audience_score versus top200_box respectively showed high 'yes' levels