As per usual, we’ll start by loading the relevant packages as well as our data. The file we’ll read in has data on 651 movies. I’m particularly interested in how the critic’s ratings relate to the audience’s ratings.
library(dplyr)
library(ggplot2)
movies <- read.csv('https://raw.githubusercontent.com/idc9/stor390/master/data/movies.csv')
#now take a look at the data
View(movies)
movies %>% ggplot(aes(critics_score, audience_score))+geom_point()
movies %>% ggplot(aes(imdb_num_votes, imdb_rating))+geom_point()
movies %>% ggplot(aes(runtime)) + geom_histogram()
lm(audience_score~critics_score, data=movies)
The equation is
audience_score = \(0.501\cdot\) critics_score \(+ 33.4\)
What audience_score would we predict for a movie with a critics score of 50?
What audience_score would we predict for a movie with a critics score of 80?
What audience_score would we predict for a movie with a critics score of 0?
movies %>% ggplot(aes(critics_score, audience_score))+geom_point()+
geom_smooth(method="lm")
Residuals are the difference between actual values and what our model (meaning our best-fit line in this case) predicted. The code below calculates the residuals and assigns them to a new columns in our data set. A positive residual in this case means that the audience like the movie more than we might expect based on the critic’s scores.
m <- lm(audience_score~critics_score, data=movies)
movies$residuals <- residuals(m)
First, let’s take a look at the residuals:
View(movies)
We can use box plots to see which types of movies audience’s like more (and less) than critics.
movies %>% ggplot(aes(genre, residuals))+geom_boxplot()+ theme(axis.text.x = element_text(angle = 90))
Another way to look at this is to make a different best fit line for each movie genre. We can do that graphically as follows:
movies %>% ggplot(aes(critics_score, audience_score, group=genre, color=genre))+geom_point(size=0.5)+geom_smooth(aes(color=genre), method="lm", se=FALSE)