Introduction

the topic of my analysis was movies that came out in the last 30 years. I got this data from Kaggle and the data was made by scrapping IMDB. I chose this topic because movies are an integral part of our society ad how it chooses to spend its free time. Furthermore I think there is a lot to be gained from researching different patterns how how movies are made and how the general public reacts to movies when looked at over a period of time. In this analysis the variables I will use are: Budget(the amount of money that was planned to be spent making the movie); gross(the amount of money made by the movie after subtracting the amount of money spent making the movie); year(the year that the movie was released in); country(the country where the company that made the movie is located); genre(the most prominent genre that the movie is labeled as).

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(streamgraph)
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.4
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
setwd("C:/Users/noahz/Desktop/Data 110 R/Datasets/Hate crime data sets")
movies <- read_csv("movies.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   budget = col_double(),
##   company = col_character(),
##   country = col_character(),
##   director = col_character(),
##   genre = col_character(),
##   gross = col_double(),
##   name = col_character(),
##   rating = col_character(),
##   released = col_character(),
##   runtime = col_double(),
##   score = col_double(),
##   star = col_character(),
##   votes = col_double(),
##   writer = col_character(),
##   year = col_double()
## )
#view(movies)

This cleans up the release date to just 4 didget year because it was formatted YYYY-MM-DD which I think could be hard for the computer to work with.

#gsub("-..*", "", movies$released)

I looked at the dataset again and noticed that year is already a variable so cleaning up the release date is completely unnecessary. I will leave it in to show my process.

I decided to start experimenting with graphs and decided to make a stream graph as I haven’t before. I started to have problems with the graph not showing up or being completely broken so I used the variable “movies2” to filter the genres in hopes to find the problem. I ended up not needing it so I just commented it out.

#movies2 <- movies %>%
  #pivot_wider(names_from = "genre", values_from = "score")
#movies2
#colMeans(movies2[14:30], na.rm = TRUE)

After looking at past lecture notes I realized that my problem was that my dataset was not formatted for a streamgraph and after a little testing I was able to make a dataset that was formatted correctly.

movies3 <- movies %>%
  count(genre, year)
movies3
## # A tibble: 342 x 3
##    genre   year     n
##    <chr>  <dbl> <int>
##  1 Action  1986    49
##  2 Action  1987    47
##  3 Action  1988    35
##  4 Action  1989    46
##  5 Action  1990    46
##  6 Action  1991    42
##  7 Action  1992    37
##  8 Action  1993    34
##  9 Action  1994    33
## 10 Action  1995    42
## # ... with 332 more rows

I made a simple streamgraph to see when different genres of movie may have gained or lost popularity over the last 30 years.

streamgraph(movies3, key = "genre", value = "n", date = "year")
## Warning in widget_html(name = class(x)[1], package = attr(x, "package"), :
## streamgraph_html returned an object of class `list` instead of a `shiny.tag`.

That graph seemed weird as it stayed a constant shape over the 30 years which would imply that the same number of movies came out every year. So I took a count of the number of movies that came out every year and sure enough almost every year had exactly 220 movies. This may have truly happened but this also could be the case that the person collecting the data only collected 220 movies from every year which could result in a bias of what movie’s data was collected.

movies_test <- movies %>%
  count(year)
movies_test
## # A tibble: 31 x 2
##     year     n
##  * <dbl> <int>
##  1  1986   220
##  2  1987   219
##  3  1988   220
##  4  1989   221
##  5  1990   220
##  6  1991   220
##  7  1992   220
##  8  1993   220
##  9  1994   220
## 10  1995   220
## # ... with 21 more rows

Next, I decided to group the movies by country and year so that way I could better compare the productions of different countries over time.

movies_avg <- movies %>%
  group_by(country, year) %>%
  summarise(gross = mean(gross))
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
  #pivot_wider(names_from = "genre", values_from = "gross")
movies_avg
## # A tibble: 572 x 3
## # Groups:   country [57]
##    country    year     gross
##    <chr>     <dbl>     <dbl>
##  1 Argentina  1986   725000 
##  2 Argentina  1990    52148 
##  3 Argentina  1992   100986 
##  4 Argentina  1994    19710 
##  5 Argentina  2000  1221261 
##  6 Argentina  2001   363684 
##  7 Argentina  2004 16756372 
##  8 Argentina  2005  1430372.
##  9 Argentina  2007    46011 
## 10 Argentina  2009 20167424 
## # ... with 562 more rows

then I made another basic streamgraph that compares the gross profit of movies made in different countries over time and I noticed a few interesting things such as I 1 year New Zealand companies made over 300 million from movies in 1 year where as the US which is were I think of as producing the most movies only peaked at a little over 60 million but the US is very consistent with profit.

streamgraph(movies_avg, key = "country", value = "gross", date = "year")
## Warning in widget_html(name = class(x)[1], package = attr(x, "package"), :
## streamgraph_html returned an object of class `list` instead of a `shiny.tag`.

while interesting, I found the previous graph too cluttered so I pared it down to only 5 countries, some of which were consistent and some had more drastic fluctuations.

movies_avg2 <- movies %>%
  group_by(country, year) %>%
  summarise(gross = mean(gross))
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
movies_avg2$gross <- movies_avg2$gross/1000000

movies_avg2
## # A tibble: 572 x 3
## # Groups:   country [57]
##    country    year   gross
##    <chr>     <dbl>   <dbl>
##  1 Argentina  1986  0.725 
##  2 Argentina  1990  0.0521
##  3 Argentina  1992  0.101 
##  4 Argentina  1994  0.0197
##  5 Argentina  2000  1.22  
##  6 Argentina  2001  0.364 
##  7 Argentina  2004 16.8   
##  8 Argentina  2005  1.43  
##  9 Argentina  2007  0.0460
## 10 Argentina  2009 20.2   
## # ... with 562 more rows
movie_test <- movies_avg2 %>%
  filter(country == "USA"| country == "UK"| country == "New Zealand"| country == "South Africa"| country == "Canada")
  
  movie_test
## # A tibble: 115 x 3
## # Groups:   country [5]
##    country  year  gross
##    <chr>   <dbl>  <dbl>
##  1 Canada   1986  3.01 
##  2 Canada   1987 10.9  
##  3 Canada   1988 10.4  
##  4 Canada   1989  1.29 
##  5 Canada   1990  6.04 
##  6 Canada   1991  2.86 
##  7 Canada   1992  0.610
##  8 Canada   1993 15.5  
##  9 Canada   1994 11.4  
## 10 Canada   1995  6.57 
## # ... with 105 more rows

then I graphed the newly pared down data to a doted line graph to get a better look at how the fluctuating countries compared to more stable countries. I noticed that New Zealand regularly peaks above the other countries and that the USA is remarkably consistent even when compared to the UK which I thought look fairly consistent on the previous graph.

ggplot(data = movie_test, aes(x = year, y = gross)) +
  labs(title = "gross profit from movies for 5 countries") +
  xlab("Year") +
  ylab("Gross Profit (millions of dollars)") +
  theme_dark(base_size = 13) +
  geom_line(aes(color = country)) +
  geom_point(aes(color = country)) +
  scale_color_brewer(palette = "Set2")

I was not sure where to go with that graph so I decided to explore something new. This graph is to check to see of there is any correlation between gross profit and budget.First I filtered down the data to only things from 2011 and later to reduce the clutter on the graph and filtered out anything that grossed over 400 million to remove some of the outliers and better focus on where the brunt of the data is.

filtered <- movies %>%
  filter(year >= 2011 & gross < 400000000)

filtered$budget <- filtered$budget/1000000
filtered$gross <- filtered$gross/1000000

graph <-ggplot(filtered, aes(x = budget, y = gross)) +
  geom_point()
graph2 <- graph + geom_smooth(method='lm',formula=y~x)

ggplotly(graph2)

As can be seen from the linear regression with a confidence interval, there is a correlation between the gross profit of a movie and the budget but, when I looked at just the graph without the regression I thought that the correlation was stronger.