Movies

This is my first Markdown document! This report is on the “movies” dataset from “ggplot2movies”" package.

Let’s load the “movies”" data. And look at summaries and data strucutres. (results omitted)

The raw data set contains 5215 observations of 25 variables. I noticed that 1.The “Action”“,”Animation“” …" binary (dichotomous) columns can be grouped into a new category “genre”, where the columns names would be categorical data in the “genre” variable. 2.There are a lot of missing values in the data set. 3.Looking at the min/max value for each category, the “length” and “budget” seem to have extreme outliners for max.

Let’s filter out the missing values, and re-group the different genre columns into a new “genre”" variable. In addtion to the orginal column names as the categorical data in the new “genre” variable, when “1” is the observation, for a movie that has one or more genres, meaning one or more “1” in different binary columns, the categorical observation is defined as “Mixed”, for a movie that does belong to any genre, the observation is defined as “None”.

movies <- na.omit(movies)
budget_millions <- movies$budget/1000000
genre <- rep(NA, nrow(movies))
count <- rowSums(movies[, 18:24])
genre[which(count > 1)] = "Mixed"
genre[which(count < 1)] = "None"
genre[which(count == 1 & movies$Action == 1)] = "Action"
genre[which(count == 1 & movies$Animation == 1)] = "Animation"
genre[which(count == 1 & movies$Comedy == 1)] = "Comedy"
genre[which(count == 1 & movies$Drama == 1)] = "Drama"
genre[which(count == 1 & movies$Documentary == 1)] = "Documentary"
genre[which(count == 1 & movies$Romance == 1)] = "Romance"
genre[which(count == 1 & movies$Short == 1)] = "Short"
movies$genre<-as.factor(genre)

Now, let’s do some univariate plots. For example, the count of “genre”" we just grouped, with reordering.

ggplot(movies)+geom_bar(aes(x=genre),fill="#004C99")+
  labs(title = "Count of Genre", x = "Genre", y = "Count")+coord_flip()+th

Thus, As presented by the bar graph, Drama has the highest counts, besides mixed genre. Animation has the lowest counts.

Let’s look at the distrution of movie length (the superlong movies that are over 5-hours, are not shown in the limits of the graphs)

ggplot(movies, aes(x=length)) +
  geom_bar()+
  coord_cartesian(xlim=c(0,300))+th

From the graph, bi-modality is observed for movie length. The length is not normally disturbuted.

Would the length of movies be normally disturbuted in each genres?

ggplot(data = movies, aes(x = length)) +
  geom_histogram(bins = 50) + facet_wrap(~genre) + 
  ggtitle("Histogram of length by Genre") +
  xlab("Length") + ylab("Count") + th

From the graph, biomodality is still observed in documentary and mixed movies. And besides, the disturbution of movies length for Action, Drama and short movies are slightly left skew, the movie length by genre are mostly normally disturbuted.

Let’s look at the frequency plot of budget without the missing values.

ggplot(movies, aes(x=budget_millions)) +
  geom_freqpoly(bins = 50)+th

Left-skewed distrubtion of movie budget is observed. Most movies require less than $50 millions of budget.

Simliarly, let’s look at budget by genre.(y-axis on logistic scale)

ggplot(data = movies, aes(x = budget_millions)) +
  geom_histogram(bins = 50) + facet_wrap(~genre) + 
  scale_y_log10()+
  ggtitle("Histogram of length by Genre") +
  xlab("Length") + ylab("Count") + th

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 236 rows containing missing values (geom_bar).

Strong left skewed disturbution is still observed across different genres.

Finally, let’s look at the ditubution of rating in each genre.

ggplot(data = movies, aes(x = rating)) +
  geom_histogram(bins = 30) + facet_wrap(~genre) + ggtitle("Histogram of Rating by Genre") + 
  xlab("Rating") + ylab("Count")+th

Drama right skewed count of ratings, meaning this genre has more votes with higher score. While the ratings for comedy is more evenly distrubuted.

Althernatively, let’s do a boxplot.

ggplot(aes(x = genre, y = rating), data = movies) + geom_boxplot() + ggtitle("Distribution of Ratings for different Genres") + th

As shown in the boxplot, documentary and short movies has a higher rating at its 25th percentile, median, and 75th percentile. Interestingly, Drama movies have significant outliers in rating on the downside, while the disttbution appeared to be right skewd in the histogram.

Now, answer Vinayak’s questions Has the popularity of some genre of movies decreased or increased over time? Look at the average ratings for each genre.

First, it is funny to do a time-series plot.

ggplot(movies, aes(x=year, y=rating,colour=genre,group=genre)) + stat_summary(fun.y="mean", geom="smooth")

The data are crambled to look at, but it can still be observed that roman movies is being increasing popular in very recent years.

An alternative way to approach this problem is to subset the dataset into decade ranges.

decade <- rep(NA, nrow(movies))
decade[which(movies$year>=1990)]="90s"
decade[which(movies$year>=1980 & movies$year<1990)]="80s"
decade[which(movies$year>=1970 & movies$year<1980)]="70s"
decade[which(movies$year>=1960 & movies$year<1970)]="60s"
decade[which( movies$year<1960)]="50s"
movies$decade<-as.factor(decade)
decade<-na.omit(decade)

Now redo the average time-dependent plot

ggplot(movies, aes(x=decade, y=rating,colour=genre,group=genre)) + stat_summary(fun.y="mean", geom="smooth")

It is easier to view right now. Anmiation has lost popularity in recent years.

The other Vinayak question is “Find the average length of movies for recent 5 decades. See if there is any pattern with time”

ggplot(movies, aes(x=decade, y=length,colour=genre,group=genre)) + stat_summary(fun.y="mean", geom="smooth")

Wow, this is very interesting.Documentary, short and animation movies all have peak movie length in the 70s.The most recent trend is shorter movies.

We noticed before that length of mives has bimodality. So let focus on longer movies over 50 mintues, to get a the nomrally disturbuted part of data.

long_movies <- subset(movies,length > 50)
ggplot(data = long_movies, aes(x = length, y = rating)) + geom_point(alpha=0.25) + 
  geom_smooth(method = "lm" )+
  labs(title = "Length and Rating", x = "Length", y = "rating")+th

In general, excluding movies under 50 mintues, longer movies are asscoaited with higher rating.

ggplot(data = movies, aes(x = length, y = rating, col=genre)) + geom_point(alpha=0.5) + 
  geom_smooth(method = "lm" )+facet_wrap(~genre,ncol=3)+coord_cartesian(xlim=c(0,300))+
  labs(title = "Length and Rating", x = "Length", y = "rating")+th

Looking by genre, for action, comedy and roamnce movies, longer movie is associated with higher rating.

Then, let’s see if budget has any relationship with rating. Hypothesis 2: Higher budget is associated with higher rating. We noticed earlier that the distrubution of budget is strongly left skewed. So let break the anlyansis into budget ranges of either “low_budget” or “high_budget” The median is 3 million, we will take that as the separation point.

Let’s look at low_budget & rating for different genres,

ggplot(data = low_budget, aes(x = budget, y = rating, col=genre)) + geom_point() + 
  geom_smooth(method = "lm" )+facet_wrap(~genre,ncol=3)+
  labs(title = "Low_budget and Rating", x = "low_budget", y = "rating")+th

For low_budget action mives, higher budget is associated with lower rating. For low budget animation and romance ovies, higher budget is associated with higher ratings.

ggplot(data = na.omit(high_budget), aes(x = budget, y = rating, col=genre)) + geom_point() + 
  geom_smooth(method = "lm" )+facet_wrap(~genre,ncol=3)+
  labs(title = "High_budget and Rating", x = "low_budget", y = "rating")+th

Interestingly, For high_budget action movies, higher budget is assciated with higher ratings. for romance movies, higher budget is assciated with lower ratings.

Now, try subsetting the data based on the number of votes. I will do that based on the quantiles. Min. 1st Qu. Median Mean 3rd Qu. Max. 5 67 612 4974 4642 157600

vote_range <- rep(NA, nrow(movies))
vote_range[which(movies$votes<=67)]="a few votes"
vote_range[which(movies$votes<=612 & movies$votes>67)]="median votes"
vote_range[which(movies$votes<=4642 & movies$votes>612)]="quite many votes"
vote_range[which(movies$votes>4642)]="very many votes"
movies$vote_range<-as.factor(vote_range)
vote_range<-na.omit(vote_range)

Hypothesis #3 more votes is associated with higher rating

ggplot(movies, aes(x=vote_range, y=rating,colour=genre,group=genre)) + stat_summary(fun.y="mean", geom="smooth")

I believe this hypothesis hold true for all genres.

require(graphics)
lmvr<-lm(movies$rating ~ movies$vote_range)
summary(lmvr)

## 
## Call:
## lm(formula = movies$rating ~ movies$vote_range)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2229 -0.9370  0.1109  1.0370  3.7771 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        6.22289    0.04126 150.808  < 2e-16 ***
## movies$vote_rangemedian votes     -0.63384    0.05836 -10.862  < 2e-16 ***
## movies$vote_rangequite many votes -0.20881    0.05843  -3.573 0.000356 ***
## movies$vote_rangevery many votes   0.51407    0.05838   8.806  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.491 on 5211 degrees of freedom
## Multiple R-squared:  0.07138,    Adjusted R-squared:  0.07084 
## F-statistic: 133.5 on 3 and 5211 DF,  p-value: < 2.2e-16

plot(lmvr,las=1)

Looking at the P value, R squared, Standard error and residuals disturbution. I have more confidence in that that null hypothesis #3 can be retained.