First we bring in all the libraries we will be using. Then we load the data set we have downloaded.
#Load in Libraries
library(tidyr)
library(readr)
library(dplyr)
library(forcats)
library(lubridate)
library(stringr)
library(janitor)
library(ggplot2)
library(scales)
library(pwrss)
library(tidyverse)
library(ggthemes)
library(ggrepel)
library(effsize)
library(broom)
library(boot)
library(lindia)
library(xts)
library(tsibble)
#Load in the dataset
movies_raw <- read_csv("/Users/jus10segrest/Downloads/iu indy/stat for data science/movies.csv")
#remove all na's
movies_raw <- movies_raw |>
drop_na(name)
movies_raw <- movies_raw |>
drop_na(score)
movies_raw <- movies_raw |>
drop_na(released)
The next step for our data set is to clean it and format it so that we can begin to work through it.
#create a new table separating the released column into two release date/country
movies_ <- movies_raw |>
separate(released, into = c("release_new","country_released"), sep=" \\(") |>
mutate(country_released = str_remove(country_released, "\\)$")) |> #remove the end parathensis
mutate(release_date=mdy(release_new)) |> #then change the date to an easier format
rename(country_filmed=country) #rename column for ease of understanding
movies_ <- movies_ |>
drop_na(release_date)
movies_
## # A tibble: 7,653 × 17
## name rating genre year release_new country_released score votes director
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 The Sh… R Drama 1980 June 13, 1… United States 8.4 9.27e5 Stanley…
## 2 The Bl… R Adve… 1980 July 2, 19… United States 5.8 6.5 e4 Randal …
## 3 Star W… PG Acti… 1980 June 20, 1… United States 8.7 1.20e6 Irvin K…
## 4 Airpla… PG Come… 1980 July 2, 19… United States 7.7 2.21e5 Jim Abr…
## 5 Caddys… R Come… 1980 July 25, 1… United States 7.3 1.08e5 Harold …
## 6 Friday… R Horr… 1980 May 9, 1980 United States 6.4 1.23e5 Sean S.…
## 7 The Bl… R Acti… 1980 June 20, 1… United States 7.9 1.88e5 John La…
## 8 Raging… R Biog… 1980 December 1… United States 8.2 3.30e5 Martin …
## 9 Superm… PG Acti… 1980 June 19, 1… United States 6.8 1.01e5 Richard…
## 10 The Lo… R Biog… 1980 May 16, 19… United States 7 1 e4 Walter …
## # ℹ 7,643 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## # budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## # release_date <date>
For my time column I actually have already done this mutation in the past. In the original movies.csv, their is a column listed as released and the format is ex. June 13, 1980(United States). Now this isn’t a column that can be used without splitting it and mutating it from there. So what I did was split the column into release_new (ex. June 13, 1980) and country_released (ex. United States). I then changed the date format into a new column called release date which is formatted YYYY-MM-DD. This makes it much easier to use in the data and easier to understand as well. All of this can be seen above in my initial code.
The column I will choose to analyze over time is score. I decided to go with this column because it will be interesting to see if it has been stagnant over time or if there has been a decline/increase in the average moving rating. I decided against doing gross revenue or budget as I wouldn’t be able to account for inflation so I would assume that movies will just have a natural increase in gross and budget over time, making it not that interesting.
First we need to select just our data that we need, this being name and release_date. I figured name was the best choice for this as it would be a way to have distinct rows for later at all times. I also will use the name of the movie as my key. Since multiple movies can be released on the same day, and there is no way to get a deeper level of specific time, using the name column will allow me to differentiate them.
#filter only pertinent time series data, and remove duplicates
movies_time <- movies_ |>
select(name, release_date, score) |>
distinct()
# look at the data
#View(movies_time)
From the table we can see that we still contain all of the data even when making sure all rows are distinct. Now we need to create a tsibble table to clean the data and leave no gaps for days.
# create a tsibble
movies_ts <- as_tsibble(movies_time, key=name, index=release_date) |>
index_by(date = date(release_date)) |>
summarise(avg_score = mean(score, na.rm = TRUE)) |>
fill_gaps(avg_score = NA) # in my case, a missing row is NA as no movies were released and I don't want a 0 to skew my data
Above is my tsibble table of the data. Since I want to see how scores evolve over time I had to change the code to match what my data needed. This means that I have NA’s instead of 0’s as 0’s would skew my data lower.
# an "xts" object separate from the original
movies_xts <- xts(x = movies_ts$avg_score,
order.by = movies_ts$date)
movies_xts <- setNames(movies_xts, "movie_scores")
Above is the xts table I have made for the movie data set.
movies_ts |>
ggplot() +
geom_line(mapping = aes(x=date, y=avg_score)) +
theme_hc()
Using just a basic plot we can see that there are definitely some interesting trends we can investigate. It seems to contain some periods of low and high movies so I am interested to see what these period are and what the gaps might represent.
movies_xts |>
rollapply(width = 30, \(x) mean(x, na.rm = TRUE), fill = FALSE) |>
ggplot(mapping = aes(x = Index, y = movie_scores)) +
geom_line() +
labs(title = "Movie Scores Over Time",
subtitle = "Monthly Rolling Average") +
theme_hc()
This graph is pretty interesting because it shows that there is constant variance over time, but not much of a steady increase or decline over time, besides a big peak in ~2019 which could be interesting to investigate.
movies_ts |>
filter_index("2015" ~ "2015") |>
drop_na() |>
ggplot(mapping = aes(x = date, y = avg_score)) +
geom_point(size=1, shape='O') +
geom_smooth(span=0.2, color = 'blue', se=FALSE) +
labs(title = "Movie Scores During 2015-2016") +
theme_hc()
movies_ts |>
filter_index("2000" ~ "2005") |>
drop_na() |>
ggplot(mapping = aes(x = date, y = avg_score)) +
geom_point(size=1, shape='O') +
geom_smooth(span=0.2, color = 'blue', se=FALSE) +
labs(title = "Movie Scores During 2000-2004") +
theme_hc()
movies_ts |>
filter_index("2005" ~ "2015") |>
drop_na() |>
ggplot(mapping = aes(x = date, y = avg_score)) +
geom_point(size=1, shape='O') +
geom_smooth(span=0.2, color = 'blue', se=FALSE) +
labs(title = "Movie Scores During 2005-2014") +
theme_hc()
These plots turned out really cool and explain a lot about movie scores over time. From the first graph we can see that in a one year period there are many ups and downs and a huge rise at the end (more on that below). In the 5 year period we see a similar chart but it seems to have smoothed out a bit, with the ups and down becoming less. In the 10 year data we really see this, with a gradual rise throughout but not as many ups and downs as we saw in the previous graphs. This shows that over time movie scores tend to smooth out and we don’t have specific years with much better movies than others.
An interesting tidbit about why in a one year period it seems to rise and drop at different times; in the summer where we can see a noticeable increase, this is because production companies tend to release blockbusters in the summer, as people (usually kids), have more time to go out and see movies. This is also related to the end of the year as many Oscar related movies are released around then so they can use that momentum and get better press right before the Oscars. Using this data we would be able to show a movie production company to release their movie at a specific time if they want a higher score.
#create a trend plot
movies_ts |>
filter_index("2015-01" ~ "2019-12") |>
ggplot(mapping = aes(x = date, y = avg_score)) +
geom_line() +
geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
labs(title = "Movie Scores in 2019") +
theme_hc()
#create a season plot
movies_ts |>
index_by(year = floor_date(date, '3 year')) |>
ggplot(mapping = aes(x = year, y = avg_score)) +
geom_line() +
geom_smooth(span = 0.3, color = 'blue', se=FALSE, ) +
labs(title = "Average Movie Score Over Time",
subtitle = "(by half year)") +
scale_x_date(breaks = "3 year", labels = \(x) year(x)) +
theme_hc()
These graphs are very weird to me. They both come out pretty messed up no matter what combination I use of dates. I think this might be due to the nature of my data having a constant variance over time that is pretty linear when spread out over a long time as shown earlier.
From the linear regression lines in these plots though, we can see a slight positive trend over time. I think this makes sense if you think about it with context of the data. The score data in this data set is taken from IMDB, which means people need to rate movies on that sight in order for them to be in this data set. Most people who would use that website would be younger people, who then will most likely watch newer movies and rate them higher compared to if they watched older movies which they might generally rate lower. The big argument against this though is that people only generally go back and watch good movies from previous eras. People don’t usually go back and watch the bad movies from 1981, but in the current day more people would be likely to see a bad movie as they don’t know if its bad or not when they see it in the present. Maybe that is why we can’t see a positive or negative relationship at all when you take all this information into account.
#ACF
movies_ts |>
mutate(score_lag7 = lag(avg_score, 14)) |>
drop_na()
## # A tsibble: 1,998 x 3 [1D]
## date avg_score score_lag7
## <date> <dbl> <dbl>
## 1 1980-02-01 6.25 5.15
## 2 1980-02-15 6.5 6.25
## 3 1980-03-21 6.4 7.5
## 4 1980-05-09 5.75 6.6
## 5 1980-05-23 4.3 5.75
## 6 1980-05-30 6.8 6.8
## 7 1980-06-06 6.4 4.3
## 8 1980-06-13 5.9 6.8
## 9 1980-06-20 7 6.4
## 10 1980-06-25 5.4 6.1
## # ℹ 1,988 more rows
acf(movies_ts, ci = 0.95, na.action = na.exclude)
#PACF
movies_xts <- xts(movies_ts$avg_score,
order.by = movies_ts$date,
frequency = 14)
movies_xts <- setNames(movies_xts, "scores")
pacf(movies_xts, na.action = na.exclude,
xlab = "Lag", main = "PACF for Movie Scores")
From these graphs we can see a very weird distribution for ACF and but an interesting one for PACF. There seems to be a ton of variation at various lags on the PACF. I wonder if this is due to how much variance there is per year with the movie scores. That could explain the spikes and declines at various lags.