1. Example Analysis: Exploring the Movie Dataset

As per usual, we’ll start by loading the relevant packages as well as our data. The file we’ll read in has data on 651 movies. I’m particularly interested in how the critic’s ratings relate to the audience’s ratings.

library(dplyr)
library(ggplot2)
movies <- read.csv('https://raw.githubusercontent.com/idc9/stor390/master/data/movies.csv')

#now take a look at the data
View(movies)

Making Plots

movies %>% ggplot(aes(critics_score, audience_score))+geom_point()

movies %>% ggplot(aes(imdb_num_votes, imdb_rating))+geom_point()

movies %>% ggplot(aes(runtime)) + geom_histogram()

A Best Fit Line to Predict Audience Score from Critics Score

lm(audience_score~critics_score, data=movies)

The equation is

audience_score = \(0.501\cdot\) critics_score \(+ 33.4\)

  1. What audience_score would we predict for a movie with a critics score of 50?

  2. What audience_score would we predict for a movie with a critics score of 80?

  3. What audience_score would we predict for a movie with a critics score of 0?

Plotting the Best Fit Line

movies %>% ggplot(aes(critics_score, audience_score))+geom_point()+
  geom_smooth(method="lm")

Residuals

Residuals are the difference between actual values and what our model (meaning our best-fit line in this case) predicted. The code below calculates the residuals and assigns them to a new columns in our data set. A positive residual in this case means that the audience like the movie more than we might expect based on the critic’s scores.

m <- lm(audience_score~critics_score, data=movies)
movies$residuals <- residuals(m)

First, let’s take a look at the residuals:

View(movies)

We can use box plots to see which types of movies audience’s like more (and less) than critics.

movies %>% ggplot(aes(genre, residuals))+geom_boxplot()+ theme(axis.text.x = element_text(angle = 90))
  1. Which types of movies to audience like more/less than critics?

Another way to look at this is to make a different best fit line for each movie genre. We can do that graphically as follows:

movies %>% ggplot(aes(critics_score, audience_score, group=genre, color=genre))+geom_point(size=0.5)+geom_smooth(aes(color=genre), method="lm", se=FALSE)
  1. What are advantages and disadvantages of making a distinct best-fit line for each genre rather than one overall best-fit line?