First we bring in all the libraries we will be using. Then we load the data set we have downloaded.
#Load in Libraries
library(tidyr)
library(readr)
library(dplyr)
library(forcats)
library(lubridate)
library(stringr)
library(janitor)
library(ggplot2)
library(scales)
library(pwrss)
library(tidyverse)
library(ggthemes)
library(ggrepel)
library(effsize)
library(broom)
library(boot)
library(lindia)
#Load in the dataset
movies_raw <- read_csv("/Users/jus10segrest/Downloads/iu indy/stat for data science/movies.csv")
#remove all na's
movies_raw <- movies_raw |>
drop_na(score)
movies_raw <- movies_raw |>
drop_na(runtime)
movies_raw <- movies_raw |>
drop_na(budget)
movies_raw <- movies_raw |>
drop_na(votes)
The next step for our data set is to clean it and format it so that we can begin to work through it.
#create a new table separating the released column into two release date/country
movies_ <- movies_raw |>
separate(released, into = c("release_new","country_released"), sep=" \\(") |>
mutate(country_released = str_remove(country_released, "\\)$")) |> #remove the end parathensis
mutate(release_date=mdy(release_new)) |> #then change the date to an easier format
rename(country_filmed=country) #rename column for ease of understanding
movies_
## # A tibble: 5,492 × 17
## name rating genre year release_new country_released score votes director
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 The Sh… R Drama 1980 June 13, 1… United States 8.4 9.27e5 Stanley…
## 2 The Bl… R Adve… 1980 July 2, 19… United States 5.8 6.5 e4 Randal …
## 3 Star W… PG Acti… 1980 June 20, 1… United States 8.7 1.20e6 Irvin K…
## 4 Airpla… PG Come… 1980 July 2, 19… United States 7.7 2.21e5 Jim Abr…
## 5 Caddys… R Come… 1980 July 25, 1… United States 7.3 1.08e5 Harold …
## 6 Friday… R Horr… 1980 May 9, 1980 United States 6.4 1.23e5 Sean S.…
## 7 The Bl… R Acti… 1980 June 20, 1… United States 7.9 1.88e5 John La…
## 8 Raging… R Biog… 1980 December 1… United States 8.2 3.30e5 Martin …
## 9 Superm… PG Acti… 1980 June 19, 1… United States 6.8 1.01e5 Richard…
## 10 The Lo… R Biog… 1980 May 16, 19… United States 7 1 e4 Walter …
## # ℹ 5,482 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## # budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## # release_date <date>
Last week I did a linear regression model based on runtime and score. Below are the scatterplot and model outputs of that model
movies_ |>
ggplot(mapping = aes(x = runtime, y = score)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE, color = 'darkblue')
model <- lm(score ~ runtime, movies_)
model$coefficients
## (Intercept) runtime
## 3.98511815 0.02222211
I am going to look into adding the variables budget and votes into my model. They are both continuous variables and make sense when looking into what can affect the score of a movie. When looking at the scores of movies, we would instinctively assume that the budget would heavily affect it. More money equals better actors, sets, time to work on it, and many other things. So I would expect to see a positive correlation between budget and score. In terms of votes, all of our scores in the data set are from IMDB, so the vote variable is how many people have contributed a score for that specific movie. I would guess that the higher the amount of votes the higher the score. People tend to go out of their way to vote for things they like (or really hate), so it will be interesting to see which direction this goes.
Lets take a look at Multicollinearity now.
#runtime and budget
movies_ |>
ggplot(mapping = aes(x = runtime, y = budget)) +
geom_point(size = 2)
#budget and votes
movies_ |>
ggplot(mapping = aes(x = budget, y = votes)) +
geom_point(size = 2)
#runtime and votes
movies_ |>
ggplot(mapping = aes(x = runtime, y = votes)) +
geom_point(size = 2)
From these graphs we can see that their may be some relation between the variables but nothing that should seriously affect the model.
model <- lm(score ~ runtime + budget + votes, movies_)
model$coefficients
## (Intercept) runtime budget votes
## 4.405892e+00 1.756831e-02 -5.429252e-09 2.420189e-06
tidy(model, conf.int = TRUE)
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.41e+0 6.67e- 2 66.1 0 4.28e+0 4.54e+0
## 2 runtime 1.76e-2 6.39e- 4 27.5 6.04e-156 1.63e-2 1.88e-2
## 3 budget -5.43e-9 2.92e-10 -18.6 8.97e- 75 -6.00e-9 -4.86e-9
## 4 votes 2.42e-6 6.68e- 8 36.2 7.05e-258 2.29e-6 2.55e-6
When looking at the model before doing our diagnostic plots we can see that runtime still has the strongest effect over budget and votes. The average score of a movie when all of these variables are at 0 is expected to be 4.41, and for every unit increase of runtime the score will increase by 0.02. But for budget and votes it is around a 0 increase or decrease. It is interesting to see that budget has a negative effect on the score.
We can also see using the tidy function, that all of them are very significant though to the model.
gg_resfitted(model) +
geom_smooth(se=FALSE)
From this plot we can obviously see that their is not a random amount of variance in the model. This shows non-linearity as well as heteroscedasticity, this means in the future we may need to improve our model or change it to hopefully fix these issues
plots <- gg_resX(model, plot.all = FALSE)
# for each variable of interest ...
plots$runtime +
geom_smooth(se = FALSE)
plots$budget +
geom_smooth(se = FALSE)
plots$votes +
geom_smooth(se = FALSE)
I included all variables in the graphs as I was curious of what we would see. From this we can see that runtime and votes are bigger issues of our fitted values than budget seems to be. Votes seems to indicate that higher values have higher residuals and we can see a smaller version of this in runtime. But in all of these we can see that budget is the closest to a linear relationship with score, with runtime and votes having a more non-linear relationship.
ggplot(data = movies_) +
geom_point(mapping = aes(x=votes, y=score))
From this scatter plot we can see what I mentioned early when discussing the new variables. People tend to only vote in mass for movies that are highly rated which is very evident in this plot.
gg_reshist(model)
When looking at this residual histogram we want it to be normally distributed with no skewness. We can see that we actually have a very normal distribution towards the middle with 0 being our middle, but we do have a skewness to the left of the distribution. This indicates that we are missing something still (evident from our earlier tests) and that we can tweak our model more.
gg_qqplot(model)
We can see from this QQ-Plot the same issues we have been having in all of our other diagnostic tests. Relative linearity except for the left tail which is skewed. When we are fixing our model in later lessons, we need to make sure that this skewness is addressed.
gg_cooksd(model, threshold = 'matlab')
We can see that we actually have hundreds of points that are irregular and have a high influence on the model. Using a scatter plot and choosing just some of the rows that are having a significant affect on the model we get this plot below.
ggplot(data = slice(movies_,
c(1436, 1437, 3621, 3964))) +
geom_point(data = movies_,
mapping = aes(x = votes, y = score)) +
geom_point(mapping = aes(x = votes, y = score),
color = 'darkred') +
geom_text_repel(mapping = aes(x = votes,
y = score,
label = name),
color = 'darkred') +
labs(title="Investigating High Influence Points",
subtitle="Label = Movie Name")
We can see what we have seen for most of the diagnostics. The most voted movies are all consistently highly rated which is affecting the score significantly. There are too many to tag these as outliers so in the future it will be interesting to see what the best course of action for these points are.