## Warning: package 'dplyr' was built under R version 3.1.3

Movies Data Set Findings

We set out to find what quantative factors best predict a movie’s voter ratings on the Internet Movie Database. To do this we used the movies data set in ggplot2. There are 58,788 movies in this data set. The data set contains a great deal of missing data. We found different ways to work around this issue. The following are the variables from that data set that we focused our Regression Model on:

Y(Predicted Variable): IMDb Voter Rating- 1-10 rating assigned by IMDb users we observed both unwieghted and with a weighting system we created.

X(Predictor Variales): Length- The length of each movie in minutes. Budget- the budget of each movie in USD. MPAA Rating- The Motion Picture Association of America assigns various ratings to movies based on the appropriateness of there content. The ratings as they apply to our data set are PG, PG-13, R, and NC-17. Genre- Genre consists of seven different binary variables. Each movie can be assigned more than one genre.

Other Important Variables: Votes- The number of times a movie was assigned a rating on IMDb. Year Released- the year each movie was released in theaters.

Our first challenge with this data set was to deal with the low median votes. The median number of votes a movie recieved was 30 while the mean was 632. The votes data is extremely right skewed in its distribution. The boxplot as you can see below looks like an upside down T:

The problem this created was that the majority of our entries in the data set (87%) Did not have an appropiate sample size. We determined that an adequate sample size when the population is all IMDb users is 384 votes. To account for this we created a weighting system that applies a scaling negative weight to all the films with fewer than 384 votes.

##   voterange weight
## 1   300-384   0.95
## 2   200-299   0.90
## 3   100-199   0.85
## 4      0-99   0.80

The following is the code that we used to create our weighted IMDb ratings as a new column in that data set:

for(i in 1:nrow(movies)) if (movies[i,6] >= 1 & movies[i,6] <= 99) {(movies[i,25] = movies[i,5]*0.8)}

for(i in 1:nrow(movies)) if (movies[i,6] >= 100 & movies[i,6] <= 199) {(movies[i,25] = movies[i,5]*0.85)}

for(i in 1:nrow(movies)) if (movies[i,6] >= 200 & movies[i,6] <= 299) {(movies[i,25] = movies[i,5]*0.90)}

for(i in 1:nrow(movies)) if (movies[i,6] >= 300 & movies[i,6] <= 383) {(movies[i,25] = movies[i,5]*0.95)}

for(i in 1:nrow(movies)) if (movies[i,6] >= 383) {(movies[i,25] = movies[i,5]*1)}

names(movies)[names(movies)=="V25"] <- "WeightedRating"

Single Regression Analysis

We decided to run some single variable linear regression models to compare budget and length of movies as predictors for IMDB ratings.

Budget

We broke budget down into a few categories. The first is the raw data. The only manipulation we made was a simple removal of any data that had “NA” values for budget. We then compared those findings to a data set of movies with more than 10,000 votes. Within both the “raw data” and the data with 10,000 votes, we compared the default rating that was given by the data with the weighted rating we created. Below are our results.

Budget for Full Data

Below are the results of the full data sets comparing the default rating vs. the weighted rating we created. First are the graphs followed by the summary statistics of the single linear regression models.

Default Rating Budget Summary Statistics

## 
## Call:
## lm(formula = rating ~ budget, data = budget)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1533 -0.9518  0.1495  1.0558  3.8467 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.153e+00  2.471e-02 249.067   <2e-16 ***
## budget      -9.427e-10  9.175e-10  -1.027    0.304    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.547 on 5213 degrees of freedom
## Multiple R-squared:  0.0002025,  Adjusted R-squared:  1.068e-05 
## F-statistic: 1.056 on 1 and 5213 DF,  p-value: 0.3043

Weighted Rating Budget Summary Statistics

## 
## Call:
## lm(formula = WeightedRating ~ budget, data = wbudget)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7673 -0.9935  0.1190  1.1550  3.4799 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.565e+00  2.419e-02  230.06   <2e-16 ***
## budget      9.183e-09  8.984e-10   10.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.515 on 5213 degrees of freedom
## Multiple R-squared:  0.01965,    Adjusted R-squared:  0.01946 
## F-statistic: 104.5 on 1 and 5213 DF,  p-value: < 2.2e-16

Analysis There are a few interesting things worth noting form the analysis above. First, just comparing the graphs, our weighted rating changes the relationship of budget to rating from a slightly negative to more significantly positive. However, looking at the graph it is obvious that the relationship is not linear. There seems to be a point in which budget is no longer a strong indicator of the quality of the movie with IMDB ratings being the indicator of movie quality.

Looking at the summary statistics shows some interesting results. When using the default rating, budget is not a strong predictor variable with a p-value of .304. However, in the model using the weighted rating we created, budget suddenly becomes a statistically significant predictor variable. The model goes from having an r-squared value of 1.068e-05 with the default rating to 0.01946 with the weighted rating. Although the model is more significant, it shows that budget is only accounting for a little more than 1%of the change in the response variable of rating.

Budget for Data with More than 10,000 Votes

Below are the results of the data sets of movies that have more than 10,000 votes in which we are comparing the default rating vs. the weighted rating we created. First are the graphs followed by the summary statistics of the single linear regression model. Default Rating Budget Summary Statistics (>10,000Votes)

## 
## Call:
## lm(formula = WeightedRating ~ budget, data = whighvote)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6368 -0.5779  0.0847  0.6069  2.6577 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.564e+00  4.942e-02  153.04   <2e-16 ***
## budget      -1.161e-08  9.096e-10  -12.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8973 on 734 degrees of freedom
## Multiple R-squared:  0.1816, Adjusted R-squared:  0.1804 
## F-statistic: 162.8 on 1 and 734 DF,  p-value: < 2.2e-16

Weighted Rating Budget Summary Statistics (>10,000Votes)

## 
## Call:
## lm(formula = rating ~ budget, data = highvote)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6368 -0.5779  0.0847  0.6069  2.6577 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.564e+00  4.942e-02  153.04   <2e-16 ***
## budget      -1.161e-08  9.096e-10  -12.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8973 on 734 degrees of freedom
## Multiple R-squared:  0.1816, Adjusted R-squared:  0.1804 
## F-statistic: 162.8 on 1 and 734 DF,  p-value: < 2.2e-16

Analysis This is quite interesting, especially in relation to the above analysis. First note is that the model of the default vs. the weighted rating is identical. This makes sense as the point of the weighted rating was to penalize movies with a small amount of votes.

The next thing to note is that now the relationship of budget vs rating becomes negative. The graphs show that simply spending a large amount of money on movies does not ensure a high IMDB movie rating.

The summary statistics (again, identical) tell us that this model has an r-squared value of 0.1804. This is the strongest r-squared we got from all the single regression analyses we ran, meaning that when looking at movies with at least 10,000 votes on IMDB, budget accounts for more than 18% of the change in the IMDB rating. We concluded that producers should be careful when spending money on their films and not simply assumes that higher budgets will result in more successful films.

Length of Movie

We did a similar exercise with the length of movie as with the budget above. For this data set, we again removed any variables with a length value of “NA.” We also worked with a data set in which movies were greater than 50 minutes and less than 330 minutes. We chose these parameters with the idea of not including short films in this analysis. The Academy Awards Committee agrees that anything 50 minutes or shorter qualifies as a short film. This is why we chose 50 as a lower boundary. We chose 330 as a upper boundary due to some extreme outliers that throw off the analysis.

As above with the budget, we compared our “full” data set and our data set of greater than 10,000 votes. Within each we again compared the default rating with our weighted rating. Below are our results

Length for Full Data

Default Rating Length Summary Statistics

## 
## Call:
## lm(formula = rating ~ length, data = length)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5604 -0.9171  0.1397  1.0262  4.5775 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.5332382  0.0309502  146.47   <2e-16 ***
## length      0.0135141  0.0003164   42.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.5 on 48976 degrees of freedom
## Multiple R-squared:  0.03592,    Adjusted R-squared:  0.0359 
## F-statistic:  1825 on 1 and 48976 DF,  p-value: < 2.2e-16

Weighted Rating Length Summary Statistics

## 
## Call:
## lm(formula = WeightedRating ~ length, data = wlength)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3947 -0.8815  0.0550  0.9183  4.2684 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.332967   0.028171  118.31   <2e-16 ***
## length      0.016651   0.000288   57.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.365 on 48976 degrees of freedom
## Multiple R-squared:  0.06391,    Adjusted R-squared:  0.06389 
## F-statistic:  3343 on 1 and 48976 DF,  p-value: < 2.2e-16

Analysis When looking at length of movie as an indicator of movie ratings, we got a few interesting insights, but not much. First note that the linear relationship in both the default and weighted rating is positive. The graph shows however that the relationship is most likely not linear. When simply looking at the graph, you can see that there is a small part approximately between 140 to 190 minutes in which there is a higher concentration of movies with a relatively higher rating. We can gather from this that making a movie with a length that falls within that time is a safe bet when trying to make a movie with a high IMDB rating.

Length for Data with More than 10,000 Votes

Default Rating Length Summary Statistics (>10,000)

## 
## Call:
## lm(formula = rating ~ length, data = highlen)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8342 -0.5826  0.1180  0.6682  1.8413 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.016638   0.157312  38.247  < 2e-16 ***
## length      0.009236   0.001286   7.181 1.53e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9448 on 837 degrees of freedom
## Multiple R-squared:  0.05803,    Adjusted R-squared:  0.0569 
## F-statistic: 51.56 on 1 and 837 DF,  p-value: 1.533e-12

Weighted Rating Length Summary Statistics (>10,000)

## 
## Call:
## lm(formula = WeightedRating ~ length, data = whighlen)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8342 -0.5826  0.1180  0.6682  1.8413 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.016638   0.157312  38.247  < 2e-16 ***
## length      0.009236   0.001286   7.181 1.53e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9448 on 837 degrees of freedom
## Multiple R-squared:  0.05803,    Adjusted R-squared:  0.0569 
## F-statistic: 51.56 on 1 and 837 DF,  p-value: 1.533e-12

Analysis There was not much change in the analysis when cutting down the movies to movies with more than 10,000 votes.

Genre

There are seven unique genres to which a movie can be assigned in this data set. Genre is represented as a binary number, assigning each movie a “1” if it belongs to that genre, and a “0” if it does not belong to that genre. A movie can belong to multiple genres, and a movie can belong to no genre at all.

To begin, the weighting system was applied to “movies” to create a data set for analyzing the genre.

library(ggplot2)

#This chunk applies the weights to the movies data

for(i in 1:nrow(movies)) if (movies[i,6] >= 1 & movies[i,6] <= 99) {(movies[i,25] = movies[i,5]*0.8)}

for(i in 1:nrow(movies)) if (movies[i,6] >= 100 & movies[i,6] <= 199) {(movies[i,25] = movies[i,5]*0.85)}

for(i in 1:nrow(movies)) if (movies[i,6] >= 200 & movies[i,6] <= 299) {(movies[i,25] = movies[i,5]*0.90)}

for(i in 1:nrow(movies)) if (movies[i,6] >= 300 & movies[i,6] <= 383) {(movies[i,25] = movies[i,5]*0.95)}

for(i in 1:nrow(movies)) if (movies[i,6] >= 383) {(movies[i,25] = movies[i,5]*1)}

names(movies)[names(movies)=="V25"] <- "WeightedRating"

Next, all of the movies with no genre assigned were removed from the database. There are 12,786 movies in the original database with no genre assigned.

#This chunk removes all of the movies in the database without a genre assigned

movgenre = subset(movies, Action == 1 | Animation == 1 | Comedy == 1 | Drama == 1 | Documentary == 1 | Romance == 1 | Short == 1)

The following code creates seven separate data sets, each of which represents every movie with a specific genre assigned. For example, every movie in the “Action” subset has a “1” assigned for the action genre.

#This chunk creates separate datasets for each genre

Action = subset(movgenre, Action == 1)
Animation = subset(movgenre, Animation == 1)
Comedy = subset(movgenre, Comedy == 1)
Drama = subset(movgenre, Drama == 1)
Documentary = subset(movgenre, Documentary == 1)
Romance = subset(movgenre, Romance == 1)
Short = subset(movgenre, Short == 1)

The first analysis carried out was to determine what percentage each genre represents of the entire database. The following code calculates these percentages based on the number of rows in the entire database divided by the number of rows in each genre data set.

#This chunk determines the percentage of movies in the data set that are assinged to each genre

Actionpercent = round(nrow(Action)/nrow(movgenre)*100, digits = 2)
Animationpercent = round(nrow(Animation)/nrow(movgenre)*100, digits = 2)
Comedypercent = round(nrow(Comedy)/nrow(movgenre)*100, digits = 2)
Dramapercent = round(nrow(Drama)/nrow(movgenre)*100, digits = 2)
Documentarypercent = round(nrow(Documentary)/nrow(movgenre)*100, digits = 2)
Romancepercent = round(nrow(Romance)/nrow(movgenre)*100, digits = 2)
Shortpercent = round(nrow(Short)/nrow(movgenre)*100, digits = 2)
## 10.19 % of the movies fell into the Action genre
## 8.02 % of the movies fell into the Animation genre
## 37.54 % of the movies fell into the Comedy genre
## 47.41 % of the movies fell into the Drama genre
## 7.55 % of the movies fell into the Documentary genre
## 10.31 % of the movies fell into the Romance genre
## 20.56 % of the movies fell into the Short genre

As can be seen in the data above, the genre that represents the largest percentage of the database is Drama (47.41%), and the genre that represents the smallest percentage of the database is Documentary (7.55%). It is important to note that the sum of all of these percentages is greater than 1. This is due to the fact that movies can belong to multiple genres at once. Further analysis investigates these relationships between genres.

The following code calculates the percentage of genres that are contained within other genres. This is done by dividing the number of rows of one genre subset by the number of movies that contain each of the other genres within that subset.

#This chunk determines the percentage of genres that are contained within other genres

ActionAction = round(nrow(subset(Action, Action == 1))/nrow(Action)*100, digits = 2)
ActionAnimation = round(nrow(subset(Action, Animation == 1))/nrow(Action)*100, digits = 2)
ActionComedy = round(nrow(subset(Action, Comedy == 1))/nrow(Action)*100, digits = 2)
ActionDrama = round(nrow(subset(Action, Drama == 1))/nrow(Action)*100, digits = 2)
ActionDocumentary = round(nrow(subset(Action, Documentary == 1))/nrow(Action)*100, digits = 2)
ActionRomance = round(nrow(subset(Action, Romance == 1))/nrow(Action)*100, digits = 2)
ActionShort = round(nrow(subset(Action, Short == 1))/nrow(Action)*100, digits = 2)

AnimationAction = round(nrow(subset(Animation, Action == 1))/nrow(Animation)*100, digits = 2)
AnimationAnimation = round(nrow(subset(Animation, Animation == 1))/nrow(Animation)*100, digits = 2)
AnimationComedy = round(nrow(subset(Animation, Comedy == 1))/nrow(Animation)*100, digits = 2)
AnimationDrama = round(nrow(subset(Animation, Drama == 1))/nrow(Animation)*100, digits = 2)
AnimationDocumentary = round(nrow(subset(Animation, Documentary == 1))/nrow(Animation)*100, digits = 2)
AnimationRomance = round(nrow(subset(Animation, Romance == 1))/nrow(Animation)*100, digits = 2)
AnimationShort = round(nrow(subset(Animation, Short == 1))/nrow(Animation)*100, digits = 2)

ComedyAction = round(nrow(subset(Comedy, Action == 1))/nrow(Comedy)*100, digits = 2)
ComedyAnimation = round(nrow(subset(Comedy, Animation == 1))/nrow(Comedy)*100, digits = 2)
ComedyComedy = round(nrow(subset(Comedy, Comedy == 1))/nrow(Comedy)*100, digits = 2)
ComedyDrama = round(nrow(subset(Comedy, Drama == 1))/nrow(Comedy)*100, digits = 2)
ComedyDocumentary = round(nrow(subset(Comedy, Documentary == 1))/nrow(Comedy)*100, digits = 2)
ComedyRomance = round(nrow(subset(Comedy, Romance == 1))/nrow(Comedy)*100, digits = 2)
ComedyShort = round(nrow(subset(Comedy, Short == 1))/nrow(Comedy)*100, digits = 2)

DramaAction = round(nrow(subset(Drama, Action == 1))/nrow(Drama)*100, digits = 2)
DramaAnimation = round(nrow(subset(Drama, Animation == 1))/nrow(Drama)*100, digits = 2)
DramaComedy = round(nrow(subset(Drama, Comedy == 1))/nrow(Drama)*100, digits = 2)
DramaDrama = round(nrow(subset(Drama, Drama == 1))/nrow(Drama)*100, digits = 2)
DramaDocumentary = round(nrow(subset(Drama, Documentary == 1))/nrow(Drama)*100, digits = 2)
DramaRomance = round(nrow(subset(Drama, Romance == 1))/nrow(Drama)*100, digits = 2)
DramaShort = round(nrow(subset(Drama, Short == 1))/nrow(Drama)*100, digits = 2)

DocumentaryAction = round(nrow(subset(Documentary, Action == 1))/nrow(Documentary)*100, digits = 2)
DocumentaryAnimation = round(nrow(subset(Documentary, Animation == 1))/nrow(Documentary)*100, digits = 2)
DocumentaryComedy = round(nrow(subset(Documentary, Comedy == 1))/nrow(Documentary)*100, digits = 2)
DocumentaryDrama = round(nrow(subset(Documentary, Drama == 1))/nrow(Documentary)*100, digits = 2)
DocumentaryDocumentary = round(nrow(subset(Documentary, Documentary == 1))/nrow(Documentary)*100, digits = 2)
DocumentaryRomance = round(nrow(subset(Documentary, Romance == 1))/nrow(Documentary)*100, digits = 2)
DocumentaryShort = round(nrow(subset(Documentary, Short == 1))/nrow(Documentary)*100, digits = 2)

RomanceAction = round(nrow(subset(Romance, Action == 1))/nrow(Romance)*100, digits = 2)
RomanceAnimation = round(nrow(subset(Romance, Animation == 1))/nrow(Romance)*100, digits = 2)
RomanceComedy = round(nrow(subset(Romance, Comedy == 1))/nrow(Romance)*100, digits = 2)
RomanceDrama = round(nrow(subset(Romance, Drama == 1))/nrow(Romance)*100, digits = 2)
RomanceDocumentary = round(nrow(subset(Romance, Documentary == 1))/nrow(Romance)*100, digits = 2)
RomanceRomance = round(nrow(subset(Romance, Romance == 1))/nrow(Romance)*100, digits = 2)
RomanceShort = round(nrow(subset(Romance, Short == 1))/nrow(Romance)*100, digits = 2)

ShortAction = round(nrow(subset(Short, Action == 1))/nrow(Short)*100, digits = 2)
ShortAnimation = round(nrow(subset(Short, Animation == 1))/nrow(Short)*100, digits = 2)
ShortComedy = round(nrow(subset(Short, Comedy == 1))/nrow(Short)*100, digits = 2)
ShortDrama = round(nrow(subset(Short, Drama == 1))/nrow(Short)*100, digits = 2)
ShortDocumentary = round(nrow(subset(Short, Documentary == 1))/nrow(Short)*100, digits = 2)
ShortRomance = round(nrow(subset(Short, Romance == 1))/nrow(Short)*100, digits = 2)
ShortShort = round(nrow(subset(Short, Short == 1))/nrow(Short)*100, digits = 2)
## 100 , 1.79 , 16.55 , 38.37 , 0.34 , 5.91 and 3.09 % of Action movies also fell into the genre of Action, Animation, Comedy, Drama, Documentary, and Romance respectively.
## 2.28 , 100 , 61 , 3.66 , 1.17 , 1.08 and 84.44 % of Animated movies also fell into the genre of Action, Animation, Comedy, Drama, Documentary, and Romance respectively.
## 4.49 , 13.03 , 100 , 17.94 , 0.76 , 12.71 and 22.47 % of Comedies also fell into the genre of Action, Animation, Comedy, Drama, Documentary, and Romance respectively.
## 8.25 , 0.62 , 14.21 , 100 , 0.58 , 11.74 and 5.04 % of Dramas also fell into the genre of Action, Animation, Comedy, Drama, Documentary, and Romance respectively.
## 0.46 , 1.24 , 3.77 , 3.66 , 100 , 0.29 and 24.97 % of Documentaries also fell into the genre of Action, Animation, Comedy, Drama, Documentary, and Romance respectively.
## 5.84 , 0.84 , 46.27 , 53.98 , 0.21 , 100 and 3.54 % of Romances also fell into the genre of Action, Animation, Comedy, Drama, Documentary, and Romance respectively.
## 1.53 , 32.95 , 41.02 , 11.62 , 9.17 , 1.78 and 100 % of Shorts also fell into the genre of Action, Animation, Comedy, Drama, Documentary, and Romance respectively.

There are some genre-within-genre relationships that are interesting in the findings above. First, 38.37% of movies that are Actions are also categorized as Dramas. This makes sense due to the fact that many action movies are also dramatic. Next, 61% of animated films are also comedies. This is logical because many animated films are intended to be comedic in nature. In addition to this, 84.44% of animated films are also categorized in the “short” genre. This follows due to the fact that most animated films are short, and most short films are animated. Finally, 46.27% of romance films are also comedies, and 53.98% of romance films are also drama. The former is likely caused by an influx of romantic comedies that have become popular in recent history, and the latter is probably due to the fact that many romantic movies ca n be dramatic in nature.

It is important to note with this data that, in each genre-genre relationship, the percentage of films that share a genre from the perspective of one particular genre in the relationship is not necessarily equivalent to the percentage of films that share the genre from the other genre in the relationship. For example, 16.55% of action movies are also comedies, whereas on 4.49% of comedies are also action movies. This is caused by the differing number of films in each genre subset. In this particular example, there are 776 movies that are categorized as both action and comedic movies simultaneously. There are only 4,688 movies in the action genre subset, however, where there are 17,271 movies in the comedy genre subset. 776 represents a different percentage of 4,688 than it does 17,271. This is what causes the percentages of genre-shared movies to be different depending on perspective, when at first it may seem as though they should be equivalent.

The next genre-genre relationships that were examined were the correlations between each genre. The following code calculates these correlations.

#This chunk determines the correlation between each genre

corActionAction = round(cor(movgenre$Action, movgenre$Action), digits = 4)
corActionAnimation = round(cor(movgenre$Action, movgenre$Animation), digits = 4)
corActionComedy = round(cor(movgenre$Action, movgenre$Comedy), digits = 4)
corActionDrama = round(cor(movgenre$Action, movgenre$Drama), digits = 4)
corActionDocumentary = round(cor(movgenre$Action, movgenre$Documentary), digits = 4)
corActionRomance = round(cor(movgenre$Action, movgenre$Romance), digits = 4)
corActionShort = round(cor(movgenre$Action, movgenre$Short), digits = 4)

corAnimationAnimation = round(cor(movgenre$Animation, movgenre$Animation), digits = 4)
corAnimationComedy = round(cor(movgenre$Animation, movgenre$Comedy), digits = 4)
corAnimationDrama = round(cor(movgenre$Animation, movgenre$Drama), digits = 4)
corAnimationDocumentary = round(cor(movgenre$Animation, movgenre$Documentary), digits = 4)
corAnimationRomance = round(cor(movgenre$Animation, movgenre$Romance), digits = 4)
corAnimationShort = round(cor(movgenre$Animation, movgenre$Short), digits = 4)

corComedyComedy = round(cor(movgenre$Comedy, movgenre$Comedy), digits = 4)
corComedyDrama = round(cor(movgenre$Comedy, movgenre$Drama), digits = 4)
corComedyDocumentary = round(cor(movgenre$Comedy, movgenre$Documentary), digits = 4)
corComedyRomance = round(cor(movgenre$Comedy, movgenre$Romance), digits = 4)
corComedyShort = round(cor(movgenre$Comedy, movgenre$Short), digits = 4)

corDramaDrama = round(cor(movgenre$Drama, movgenre$Drama), digits = 4)
corDramaDocumentary = round(cor(movgenre$Drama, movgenre$Documentary), digits = 4)
corDramaRomance = round(cor(movgenre$Drama, movgenre$Romance), digits = 4)
corDramaShort = round(cor(movgenre$Drama, movgenre$Short), digits = 4)

corDocumentaryDocumentary = round(cor(movgenre$Documentary, movgenre$Documentary), digits = 4)
corDocumentaryRomance = round(cor(movgenre$Documentary, movgenre$Romance), digits = 4)
corDocumentaryShort = round(cor(movgenre$Documentary, movgenre$Short), digits = 4)

corRomanceRomance = round(cor(movgenre$Romance, movgenre$Romance), digits = 4)
corRomanceShort = round(cor(movgenre$Romance, movgenre$Short), digits = 4)

corShortShort = round(cor(movgenre$Short, movgenre$Short), digits = 4)
## The correlations between the genre of Action and Animation, Comedy, Drama, Documentary, Romance and Short are -0.0773 , -0.146 , -0.061 , -0.0919 , -0.0488 , -0.1456 respectively
## The correlations between the genre of Animation and Action, Comedy, Drama, Documentary, Romance and Short are -0.0773 , 0.1431 , -0.2588 , -0.0713 , -0.0896 , 0.4668 respectively
## The correlations between the genre of Comedy and Action, Animation, Drama, Documentary, Romance and Short are -0.146 , 0.1431 , -0.4576 , -0.1993 , 0.0611 , 0.0366 respectively
## The correlations between the genre of Drama and Action, Animation, Comedy, Documentary, Romance and Short are -0.061 , -0.2588 , -0.4576 , -0.2504 , 0.0446 , -0.3647 respectively
## The correlations between the genre of Documentary and Action, Animation, Comedy, Drama, Romance and Short are -0.0919 , -0.0713 , -0.1993 , -0.2504 , -0.0942 , 0.0312 respectively
## The correlations between the genre of Romance and Action, Animation, Comedy, Drama, Documentary and Short are -0.0488 , -0.0896 , 0.0611 , 0.0446 , -0.0942 , -0.1428 respectively
## The correlations between the genre of Short and Action, Animation, Comedy, Drama, Documentary and Romance are -0.1456 , 0.4668 , 0.0366 , -0.3647 , 0.0312 , -0.1428 respectively

There are a few interesting relationships to note from the findings above. First, the correlation between animation and drama is -.2588. This is most likely due to the fact that many animated movies are intended to be lighthearted in nature, and many movies that are meant to be dramatic are not animated. This results in a small number of movies that are both animated and dramatic simultaneously. In addition, the correlation between genres of animation and short is 0.4668. This is the highest correlation that was found in the data, and, as with the findings between these two genres previously, is likely cause by the fact that most animated movies are short, and most short movies are animated. Next, the correlation between the genres of drama and documentary was found to be -.2504. This is possibly due to the fact that many documentaries are made to be an accurate representation of a subject, and are not intended to dramatize the facts. Along with this, the correlation between the genres of drama and short was found to be -0.3647. This makes sense, due to the commonality that many dramatic movies tend to have a long running time. Finally, the most negatively correlated relationship that was found in the data was between the genres of comedy and drama at -0.4576. This is most likely because most comedies tend to not be dramatic, and most comedies tend to steer away from drama (although Shakespeare might disagree).

The last form of analysis that was performed on the genre data was to examine the relationship between each genre and the weighted and unweighted ratings all of its collective movies earn. The following code displays a box plot of the unweighted ratings for each genre.

#This chunk displays the boxplot of genre ratings for unweighted ratings

boxplot(Action$rating, Animation$rating, Comedy$rating, Drama$rating, Documentary$rating, Romance$rating, Short$rating, names = c("Action", "Animation", "Comedy", "Drama", "Documentary", "Romance", "Short"), main = "Ratings by Genre", ylab = "Rating")

Above is a boxplot displaying the unweighted ratings for each genre

From this plot, it can be seen that documentary has the highest average unweighted ratings, and action has the lowest average unweighted ratings. In addition, action appears to have the most variation in unweighted ratings (the largest IQR and least outliers), and animation appears to have the least variation in unweighted ratings.

The following code calculates the average overall unweighted rating over all seven genres.

## The mean unweighted rating for all genres is 6.18

As can be seen, the average overall unweighted rating is 6.18

The following code creates another box plot that examines the relationship between each genre and weighted ratings.

#This chunk displays the boxplot of genre ratings for weighted ratings

boxplot(Action$WeightedRating, Animation$WeightedRating, Comedy$WeightedRating, Drama$WeightedRating, Documentary$WeightedRating, Romance$WeightedRating, Short$WeightedRating, names = c("Action", "Animation", "Comedy", "Drama", "Documentary", "Romance", "Short"), main = "Ratings by Genre", ylab = "Rating")

Above is a boxplot displaying the weighted ratings for each genre

This plot looks similar to the prior, with a couple of select differences. First, the variation between all of the genres is smaller; the means for each genre are closer together. Second, the average overall ratings appears to be less. The following code calculates this value.

## The mean weighted rating for all genres is 5.19

As can be seen, the average overall weighted rating is 5.18. This is exactly one point less than the average overall unweighted rating (6.18).

Release Year

We looked at using release year as a potential regression input, but after looking at a ploe3rt of the year vs WeightedRating the connection appeared very non-linear. One thing we did pick up on is that movies made before 1920 tend to have lower ratings and then there is a wider variance as the rate of movie creation increases over the next 8 decades. You can see this in the plot below:

plot(movies$year, movies$WeightedRating)

So given all of our findings how do you set yourself up to make a movie with a high IMDb rating. See the following table:

##       variable                                     OptimalDecision
## 1        Genre                                           Animation
## 2       Length                 Doesn't matter but aim for the mean
## 3  MPAA Rating                                               PG-13
## 4 Release Year Anytime after 1920, but this shouldn't be a problem
## 5       Budget                                Spend Conservatively