The dataset we are using is available on Kaggle. The data in the dataset was collected from imdb.com, where users are able to rate various films or series on their website based on how good they thought it was. These ratings are averaged to create one single continuous numerical mean IMDb rating ranging from 1.0 to 10.0 for each film / series. This rating is the variable, Rate, in our dataset. IMDb ratings are useful to see what people think about a show or movie before watching it. Our dataset includes about 6000 of the most popular tv shows and movies.
The variables in the data set, other than Rate, that could affect IMDb rating include Date (Air date), Votes (Total votes), Genre, Duration, Type (series or film/movie), Certificate (Rated R, PG-13, etc.), Episodes, Nudity (Amount of nudity), Violence (Amount of violence), Profanity (Amount of profanity), Alcohol (Amount of alcohol), and Frightening (How frightening). The final variable in the data set is Name, which is the name of the film / series, but is unique for each row. All of these other variables are details about specific productions that may affect IMDb ratings, some of which we will be investigating.
Our research questions are primarily analyzing the relationship between different variables in the content of a production influencing the IMDb rating. Specifically, we want to know about:
These topics can be very interesting, as we can figure out what factors might be influencing IMDb rating and their degree of influence, if there is any. This data can help with the creation of any series or movies since it shows what elements the production should keep in mind in order to potentially increase the IMDB rating. For example, if IMDb rating is identified to increase when there is no nudity, maybe the producer of a series or movie could decide to remove any nudity they might have included. Producers would want a higher rating in order to convince others to watch their production, or to make their production look better overall.
Here is a quick look at our codebook:
Variable Name | Description |
---|---|
Name | Name of the production |
Date | Date of creation / airing for production |
Rate | IMDb rating for specific production |
Votes | Number of votes for IMDb rating recieved (Scale of 1-10) |
Genre | Genre (Drama, Comedy, etc.) |
Duration | Duration (min) |
Type | Whether it is a film or series |
Certificate | Rating of film or series (R, PG-13, etc.) |
Episodes | Number of episodes |
Nudity | Amount of nudity (No Rate, None, Mild, Moderate, Severe) |
Violence | Amout of violence (No Rate, None, Mild, Moderate, Severe) |
Profanity | Amount of profanity (No Rate, None, Mild, Moderate, Severe) |
Alcohol | Amount of alcohol (No Rate, None, Mild, Moderate, Severe) |
Frightening | How frightening the film or series was (No Rate, None, Mild, Moderate, Severe) |
In this section we will analyze our data to answer our research questions.
This question could help figure out a good model that could be used to predict the IMDb rating of various films and series. Using this model, producers could calculate what IMDb rating they could recieve based on various traits in their production. Many factors can potentially influence IMDb rating, as people may prefer certain aspects of a production, causing them to rate it highly. In order to create a good model to properly predict IMDb rating, we want to find variables that may have a relationship with Rate. We can now test some variables and see if they have any relationship with the response variable, Rate.
We will first filter the dataset to remove any observations for variables we may use that have “None” as a category, since that means the production did not get a specific rating for that variable. We will also only use productions with a duration of less than 250 minutes, since the majority of the data is within this.
We will next calculate some statistics and create visualizations to figure out what potential variables we may include in our model. We want to find variables that have a relationship with our response variable, Rate, to use in our model since they could help predict IMDb rating more accurately.
The first variable we will look at will be Rate, our response variable. We will calculate the mean and median Rate for productions in our data set to potentially compare to our other variables.
## # A tibble: 1 × 2
## mean_Rate median_Rate
## <dbl> <dbl>
## 1 6.91 7
We notice both the mean and median IMDb rating for productions in our data set is around 7. We will next calculate the same statistics for one of our potential predictor variables, Type, separated by its two categories, series or film.
## # A tibble: 2 × 3
## Type mean_Type median_Type
## <chr> <dbl> <dbl>
## 1 Film 6.67 6.8
## 2 Series 7.62 7.7
We can observe with these statistics that films and series have different ratings on average, as the median and mean rating for films are about 1 rating lower than median and mean ratings for series. They are also both not the same as the actual median or mean values for Rate calculated earlier, further showing that there is some sort of relationship between these variables. There already looks like there is a relationship between Type, type of production, and Rate, IMDb rating, as we expect series to have a higher rating than films. We can use this as an explanatory variable.
We will also test another potential predictor variable, Duration, by making a scatter plot comparing it to Rate.
Using the scatter plot, there is no obvious relationship between these two variables. If we look closely, we can see that as duration increases IMDb stays about the same or slightly decreases for the most part. We could use this as an explanatory variable, since it may account for some of the variability in our model as we observe somewhat of a relationship between the variables Duration and Rate.
Another potential variable we will test is Violence, representing the amount of violence in a production. We can visualize if Violence is in any way related to Rate by creating a box plot, then calculating the mean rating for each level of Violence.
## # A tibble: 4 × 2
## Violence mean_Violence
## <chr> <dbl>
## 1 Mild 7.01
## 2 Moderate 6.92
## 3 No Rate 6.58
## 4 Severe 6.89
Using the boxplot it looks like Violence has somewhat of a relationship with response, as the medians are different and the spread varies for each level of Violence. Using the means, we can see that “No Rate” has a significantly lower mean rating than the other variables and is much lower than the mean rating calculated earlier for the data set, further proving Violence could be related to Rate. We can use this variable as a potential explanatory variable, as it could help make our model more accurately predict Rate.
We will also try one last variable we create, Air_Date. Here we create a new variable called Air_Date which represents the time period, every 10 years other than years before 1960 and after 2020, a production was aired. We will create a box plot to see how it relates to our data. The category <1960s represents any air date prior to 1960.
In the box plot we can see that the every level of Air_Date has a significantly different median IMDb rating for the most part, as the data is very spread out. It looks like the median for IMDb rating varies a lot as it tends to slightly decrease every time period from before the 1960s to 2020s, with the exception of the 1990s, meaning it could potentially be related to Rate in some way. Because of this, we could use Air_Date as another predictor variable since it seems like it heavily influences Rate and can make our prediction more accurate.
Now we have identified 4 potential explanatory variables, Type, Duration, Violence, and Air_Date, that may help account for some of the variation within the data, making our model more accurate and a better fit. We will fit 3 models with these variables now to figure out which creates the best fit.
We will start fitting linear models with a training split. We will use an 80/20 split, training having 80% of the rows and testing with 20%, with a seed of 1000.
The first model that will be tested use Type and Duration as predictors for Rate, the two variables we will explore first. We will calculate the adjusted r squared for every model to see how much variability is explained by our model and see how correlated the fit of the variables is. A higher adjusted r squared means more variability is being described by our model and the variables are more correlated. Below is the adjusted r squared for the first model.
## # A tibble: 1 × 1
## adj.r.squared
## <dbl>
## 1 0.235
The adjusted r squared for this model was 0.235. We will see if adding new variable, Air_Date, could help increase our adjusted r squared to account for more variability in the data. We will now calculate the adjusted r squared for a model with Type, Duration, and Air_Date as predictors for Rate.
## # A tibble: 1 × 1
## adj.r.squared
## <dbl>
## 1 0.275
With the addition of Air_Date the adjusted r squared is 0.275, significantly greater than the first model. This model accounts for more variability within the data set and the fit of the variables are more correlated, but maybe adding one more variable could make it even better. We will now add Violence as a predictor to our model and calculate the adjusted r squared.
## # A tibble: 1 × 1
## adj.r.squared
## <dbl>
## 1 0.288
This final model has an adjusted r squared of 0.288, higher than both of the other models. We can safely say the 2nd and 3rd model account for more of the variability within our data and the fit of the variables are more correlated with the highest adjusted r squared and could potentially be more accurate with predicting rate. We can further figure out which of these two to use by calculating their rmse, which helps us figure out how accurately they predict Rate. We want an rmse closer to 0, as a smaller rmse means our model has a smaller error and is more accurate when predicting the response variable, in our case Rate.
The first model we will use is model 2, with Type, Duration, and Air_Date as predictors for Rate.
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.897
This model has an rmse of 0.897. We will now calculate the rmse for the third model which now includes Violence as a predictor along with the other variables.
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.878
This final model with the extra variable, Violence, has an rmse of 0.878. It is not that much different from our other model, as the third model only has a 0.019 smaller rmse and .013 greater adjusted r squared compared to the second model. According to Occams Razor, you should choose the model that is the simplest when comparing models that predict equally well. Since these models do predict equally well, we will choose the model that is simpler, model 2 with the variables Type, Duration, and Air_Date to predict Rate as it is simpler with less variables.
Before we interpret the model, we will observe a residual plot to see if there are any obvious patterns in our model.
It looks like the residuals are random and there are no obvious patterns. This model seems like a good fit, we can now interpret it.
Here are the coefficients for the model we choose to interpret:
## # A tibble: 10 × 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 5.99
## 2 TypeSeries 1.91
## 3 Duration 0.0136
## 4 Air_Date1960s -0.471
## 5 Air_Date1970s -0.332
## 6 Air_Date1980s -0.728
## 7 Air_Date1990s -0.695
## 8 Air_Date2000s -0.798
## 9 Air_Date2010s -0.945
## 10 Air_Date2020s -1.18
Interpretations:
(Intercept) The intercept for this model is 5.99, meaning Rate, IMDb rating, is 5.99 holding other variables in this model constant with a baseline of Type being a Film and Air_Date being <1960s, meaning the year the production was aired was before 1960.
TypeSeries: When the specific type of a series or film is series, we expect Rate, IMDb rating, to increase by 1.91 holding other variables constant. (Baseline of TypeFilm, meaning the type of the production was a film)
Duration: When duration increases by 1, meaning the duration of a production was a minute longer, then we expect Rate, IMDb rating, to increase by 0.0136 holding other variables constant.
Air_Date1960s, when a production aired in the 1960s, we expect Rate, IMDb rating, to increase by -0.471, or decrease by 0.471, holding other variables constant. (Using a baseline of Air_Date<1960s, meaning a production aired prior to 1960.)
Air_Date1970s, when a production aired in the 1970s, we expect Rate, IMDb rating, to increase by -0.332, or decrease by 0.332, holding other variables constant. (Using a baseline of Air_Date<1960s, meaning a production aired prior to 1960.)
Air_Date1980s, when a production aired in the 1980s, we expect Rate, IMDb rating, to increase by -0.728, or decrease by 0.728, holding other variables constant. (Using a baseline of Air_Date<1960s, meaning a production aired prior to 1960.)
Air_Date1990s, when a production aired in the 1990s, we expect Rate, IMDb rating, to increase by -0.695, or decrease by 0.695, holding other variables constant. (Using a baseline of Air_Date<1960s, meaning a production aired prior to 1960.)
Air_Date2000s, when a production aired in the 2000s, we expect Rate, IMDb rating, to increase by -0.798, or decrease by 0.798, holding other variables constant. (Using a baseline of Air_Date<1960s, meaning a production aired prior to 1960.)
Air_Date2010s, when a production aired in the 2010s, we expect Rate, IMDb rating, to increase by -0.945, or decrease by 0.945, holding other variables constant. (Using a baseline of Air_Date<1960s, meaning a production aired prior to 1960.)
Air_Date2020s, when a production aired in the 2010s, we expect Rate, IMDb rating, to increase by -1.18, or decrease by 1.18, holding other variables constant. (Using a baseline of Air_Date<1960s, meaning a production aired prior to 1960.)
This is a good fit model that could be used to predict Rate, IMDb rating, for a production using the specific variables Type, Duration, and Air_Date.
In this sub-section, we will try to explore if there is a correlation between the IMDb rating of a film or series and the level of nudity in its content. This question could help any producers figure out how nudity could potentially affect the IMDb rating of their production.
First, since there are some productions of which the level of nudity was not rated, we need to filter out those exceptions. We now can calculate the general median score for rate in order to compare it with the median rate of each level of nudity.
## # A tibble: 1 × 1
## median
## <dbl>
## 1 7.1
Now to have a basic idea of general ratings for each level of nudity, we can make a box plot. We can compare the difference in median IMDb rating for each level by using a line to indicate what the actual median IMDb rating in the data set is, observing how far the off median is for each level.
By simply looking at the box plot, it seems like there is no obvious correlation between the level of nudity of a film or series and its IMDb rating. However, while the median of the lower two levels (None and Mild) seems to be either the same or extremely close to the general median, the median rate of the higher levels (Moderate, Severe) of nudity is in fact slightly higher. This might suggest that audiences would prefer more nudity in a film or series.
To make further and more precise examinations, we will set the level “None”, “Mild” into a category called “Low”, and set the “Moderate” and “Severe” level into a category called “High”. We can then calculate the median IMDb rating for these two levels.
## # A tibble: 2 × 2
## nudity_binary median
## <chr> <dbl>
## 1 High 7.1
## 2 Low 7
There still doesn’t seem to be any difference between the two medians. To further figure out if there is any relationship, we can calculate a 99% confidence interval of the difference in mean IMDb rating between productions with “High” nudity level and those with “Low” nudity level and visualize the distribution of the difference in Rating between the levels.
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 -0.0565 0.115
We can say with 99% confidence that the true mean difference in IMDb ratings between productions with “High” nudity level and those with “Low” nudity level is between -0.056 and 0.115. Since this confidence includes 0, it doesn’t necessarily suggest that productions with “High” level of nudity have higher ratings than those with “Low” level of nudity. To analyze further, we will test our hypothesis with the null distribution below.
We will use an alternative hypothesis that audiences favor productions with “High” level of nudity over the ones with “Low” level of nudity, and the null hypothesis will be that audiences are indifferent of whether the production contains a “High” level of nudity or a “Low” level of nudity.
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.248
Since the p-value is very large, 0.248, which is also greater than the conventional threshold of 0.05, we don’t have enough evidence to reject the null hypothesis of “audiences are indifferent of whether the production contains a”High” level of nudity or an “Low” level of nudity.” Therefore, we can also further conclude that there is no correlation between the level of nudity in a production and the IMDb rate of the production.
According to filmsite.org, the most common category for best picture nominee and overall winner has been the Drama category. This leads us to ask the question: while Dramas may have the most critical acclaim, does this translate into wide spread mainstream appeal as indicated by IMDb rating?
First we took our data and then filtered out all of the movies and tv shows that had no rating as they provided no information with regards to answering our question. We also filtered out all of the duplicate entries in our dataset. Then we reduced our dataframe to only name, rate, and genre and changed the genre to have a binary encoding of Drama and Non-Drama.
Next, we wanted to visualize the distribution of IMDb scores for dramas versus non-dramas as seen below.
As can be seen in the histogram for the distribution of counts for rating in dramas and non-dramas, both dramas and non-dramas appear to be unimodal and left-skewed. In the graph it appears as though dramas tend to have higher ratings on average than non-dramas as the peak of the data in the dramas graph is higher than that of the non-dramas.
As can be seen in the boxplot for the distribution of IMDB Rating for dramas versus non-dramas, dramas do have a higher median rating than non-dramas. The spread of the data for both dramas and non-dramas appear to be of about the same size. We can further calculate some statistics for both genres.
## # A tibble: 2 × 4
## Genre median mean count
## <chr> <dbl> <dbl> <int>
## 1 Drama 7.3 7.17 2593
## 2 Non-Drama 6.7 6.66 2282
Both the median and mean IMDb for shows and movies in the Drama genre are higher than shows and movies that are not dramas. This leads us to think that Dramas may in fact have higher scores on average, but we cannot say anything with confidence until we run some tests.
The first test we will run is the null hypothesis test. Our null hypothesis is that the IMDb rating is independent of the genre for dramas versus non-dramas. Our alternative is that dramas have a higher IMDb rating than other genres. Our null hypothesis test will use permutation and be of the difference in medians between dramas and non-dramas. We will select a significance level of 0.01.
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
Since our p-value is incredibly small, so small that it rounds to 0, we can confidently reject the null hypothesis in favor of our alternative as the observed sample falls in the tails of the null distribution and is below the significance level. But just to be sure, we will also run a confidence interval test on the data.
Below is a simulation-based bootstrap distribution of the difference in mean IMDb rating between dramas and non-dramas.
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.435 0.591
We used a 99% confidence interval, so we can say with 99% confidence that the true mean difference in IMDb ratings for dramas versus non-dramas is between 0.435 and 0.591. Since 0 is not included in the interval, we are 99% confident that the IMDb rating for dramas is higher than non-dramas.
With all of this evidence we can say with confidence that shows and movies that are dramas are on average more likely to have a higher IMDb rating than non-dramas.
In the good fit model created, we found out how much of an impact certain variables had on IMDb rating. It seems like Type has the largest impact on IMDb rating, as when the type was a series we expect the rating to increase by 1.91 using film as the baseline. Duration also has a positive relationship with Rate with a slope of 0.0136, as when Duration increases so does Rate. With these two variables, it could be interesting for anybody thinking about creating a production that series tend to have higher IMDb ratings than films and usually the longer the duration the higher the rating. They could use these observations to create a series that is longer than usual, potentially having a very high IMDb rating. The IMDb ratings seem to be decreasing by time period, something people can’t necessarily control. Though, some could use this information to potentially create a production that has a theme that was common prior to the 1960s, since it might be more popular and generate a higher IMDb rating.
Furthermore, we notice that productions with more nudity content seems to have a higher IMDb rating in our data. However, Based on our research, we have not found enough evidence to support that the amount of nudity in a production will affect the IMDb rating. This could suggest that the difference of median in rating between productions with more nudity and productions with less nudity appeared in our data might be pure coincidence, and audiences does not rate movies or films based on the amount of nudity in the content. With this conclusion, producers of films or series should add or delete nudity scenes depending on personal preference as we were not able to conclude that there is an impact on IMDb rating based on nudity.
Finally, through our research we found that shows and movies in the drama genre tend to have a higher rating than other genres as indicted by IMDb rating. This could be due to cultural norms as to what is considered a “good” movie. Dramas tend to have room for actors and actresses to express themselves and show off their acting skills. They also tend to be about more serious topics and have more societal commentary than other genres such as comedies. As said before, dramas do tend to win best picture at the academy awards more than other genres, which now looks like more than just a coincidence.
The analysis conducted in this research wasn’t perfect, as it could have been improved. One way we could improve our analysis would be to test and compare more variables to each other, as some variables may have some sort of correlation to one another that could potentially affect IMDb rating. For the model we could have also tested more variables rather than those four, but it was tough as the variables in the data set weren’t easy to work with and the levels were very generalized for variables like Violence and Frightening. One problem with the coefficients in the model created is that films are expected to have a longer duration than series. This could potentially account for why the TypeSeries variable was so high, meaning we expected Series to have a much greater IMDb rating as opposed to films. The duration for a film is much longer, meaning the duration coefficient would be multiplied by a greater number for films than series using the models equation, increasing the predicted IMDb rating.
The data set wasn’t perfect either. One error in the data set was that the data isn’t completely up to date, as some of the IMDb ratings have changed over time and could have potentially changed our results. There aren’t also any defined levels for some variables. For example, Violence has levels that use generalizations to describe how much violence is in the film, like “none” or “moderate”. These aren’t exact levels and may be subjective and could potentially be not accurate.
Hopefully these results can help any producers decide on what they could potentially include in their productions in order to increase their IMDb rating.