the topic of my analysis was movies that came out in the last 30 years. I got this data from Kaggle and the data was made by scrapping IMDB. I chose this topic because movies are an integral part of our society ad how it chooses to spend its free time. Furthermore I think there is a lot to be gained from researching different patterns how how movies are made and how the general public reacts to movies when looked at over a period of time. In this analysis the variables I will use are: Budget(the amount of money that was planned to be spent making the movie); gross(the amount of money made by the movie after subtracting the amount of money spent making the movie); year(the year that the movie was released in); country(the country where the company that made the movie is located); genre(the most prominent genre that the movie is labeled as).
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(streamgraph)
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.4
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
setwd("C:/Users/noahz/Desktop/Data 110 R/Datasets/Hate crime data sets")
movies <- read_csv("movies.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## budget = col_double(),
## company = col_character(),
## country = col_character(),
## director = col_character(),
## genre = col_character(),
## gross = col_double(),
## name = col_character(),
## rating = col_character(),
## released = col_character(),
## runtime = col_double(),
## score = col_double(),
## star = col_character(),
## votes = col_double(),
## writer = col_character(),
## year = col_double()
## )
#view(movies)
This cleans up the release date to just 4 didget year because it was formatted YYYY-MM-DD which I think could be hard for the computer to work with.
#gsub("-..*", "", movies$released)
I looked at the dataset again and noticed that year is already a variable so cleaning up the release date is completely unnecessary. I will leave it in to show my process.
I decided to start experimenting with graphs and decided to make a stream graph as I haven’t before. I started to have problems with the graph not showing up or being completely broken so I used the variable “movies2” to filter the genres in hopes to find the problem. I ended up not needing it so I just commented it out.
#movies2 <- movies %>%
#pivot_wider(names_from = "genre", values_from = "score")
#movies2
#colMeans(movies2[14:30], na.rm = TRUE)
After looking at past lecture notes I realized that my problem was that my dataset was not formatted for a streamgraph and after a little testing I was able to make a dataset that was formatted correctly.
movies3 <- movies %>%
count(genre, year)
movies3
## # A tibble: 342 x 3
## genre year n
## <chr> <dbl> <int>
## 1 Action 1986 49
## 2 Action 1987 47
## 3 Action 1988 35
## 4 Action 1989 46
## 5 Action 1990 46
## 6 Action 1991 42
## 7 Action 1992 37
## 8 Action 1993 34
## 9 Action 1994 33
## 10 Action 1995 42
## # ... with 332 more rows
I made a simple streamgraph to see when different genres of movie may have gained or lost popularity over the last 30 years.
streamgraph(movies3, key = "genre", value = "n", date = "year")
## Warning in widget_html(name = class(x)[1], package = attr(x, "package"), :
## streamgraph_html returned an object of class `list` instead of a `shiny.tag`.
That graph seemed weird as it stayed a constant shape over the 30 years which would imply that the same number of movies came out every year. So I took a count of the number of movies that came out every year and sure enough almost every year had exactly 220 movies. This may have truly happened but this also could be the case that the person collecting the data only collected 220 movies from every year which could result in a bias of what movie’s data was collected.
movies_test <- movies %>%
count(year)
movies_test
## # A tibble: 31 x 2
## year n
## * <dbl> <int>
## 1 1986 220
## 2 1987 219
## 3 1988 220
## 4 1989 221
## 5 1990 220
## 6 1991 220
## 7 1992 220
## 8 1993 220
## 9 1994 220
## 10 1995 220
## # ... with 21 more rows
Next, I decided to group the movies by country and year so that way I could better compare the productions of different countries over time.
movies_avg <- movies %>%
group_by(country, year) %>%
summarise(gross = mean(gross))
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
#pivot_wider(names_from = "genre", values_from = "gross")
movies_avg
## # A tibble: 572 x 3
## # Groups: country [57]
## country year gross
## <chr> <dbl> <dbl>
## 1 Argentina 1986 725000
## 2 Argentina 1990 52148
## 3 Argentina 1992 100986
## 4 Argentina 1994 19710
## 5 Argentina 2000 1221261
## 6 Argentina 2001 363684
## 7 Argentina 2004 16756372
## 8 Argentina 2005 1430372.
## 9 Argentina 2007 46011
## 10 Argentina 2009 20167424
## # ... with 562 more rows
then I made another basic streamgraph that compares the gross profit of movies made in different countries over time and I noticed a few interesting things such as I 1 year New Zealand companies made over 300 million from movies in 1 year where as the US which is were I think of as producing the most movies only peaked at a little over 60 million but the US is very consistent with profit.
streamgraph(movies_avg, key = "country", value = "gross", date = "year")
## Warning in widget_html(name = class(x)[1], package = attr(x, "package"), :
## streamgraph_html returned an object of class `list` instead of a `shiny.tag`.
while interesting, I found the previous graph too cluttered so I pared it down to only 5 countries, some of which were consistent and some had more drastic fluctuations.
movies_avg2 <- movies %>%
group_by(country, year) %>%
summarise(gross = mean(gross))
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
movies_avg2$gross <- movies_avg2$gross/1000000
movies_avg2
## # A tibble: 572 x 3
## # Groups: country [57]
## country year gross
## <chr> <dbl> <dbl>
## 1 Argentina 1986 0.725
## 2 Argentina 1990 0.0521
## 3 Argentina 1992 0.101
## 4 Argentina 1994 0.0197
## 5 Argentina 2000 1.22
## 6 Argentina 2001 0.364
## 7 Argentina 2004 16.8
## 8 Argentina 2005 1.43
## 9 Argentina 2007 0.0460
## 10 Argentina 2009 20.2
## # ... with 562 more rows
movie_test <- movies_avg2 %>%
filter(country == "USA"| country == "UK"| country == "New Zealand"| country == "South Africa"| country == "Canada")
movie_test
## # A tibble: 115 x 3
## # Groups: country [5]
## country year gross
## <chr> <dbl> <dbl>
## 1 Canada 1986 3.01
## 2 Canada 1987 10.9
## 3 Canada 1988 10.4
## 4 Canada 1989 1.29
## 5 Canada 1990 6.04
## 6 Canada 1991 2.86
## 7 Canada 1992 0.610
## 8 Canada 1993 15.5
## 9 Canada 1994 11.4
## 10 Canada 1995 6.57
## # ... with 105 more rows
then I graphed the newly pared down data to a doted line graph to get a better look at how the fluctuating countries compared to more stable countries. I noticed that New Zealand regularly peaks above the other countries and that the USA is remarkably consistent even when compared to the UK which I thought look fairly consistent on the previous graph.
ggplot(data = movie_test, aes(x = year, y = gross)) +
labs(title = "gross profit from movies for 5 countries") +
xlab("Year") +
ylab("Gross Profit (millions of dollars)") +
theme_dark(base_size = 13) +
geom_line(aes(color = country)) +
geom_point(aes(color = country)) +
scale_color_brewer(palette = "Set2")
I was not sure where to go with that graph so I decided to explore something new. This graph is to check to see of there is any correlation between gross profit and budget.First I filtered down the data to only things from 2011 and later to reduce the clutter on the graph and filtered out anything that grossed over 400 million to remove some of the outliers and better focus on where the brunt of the data is.
filtered <- movies %>%
filter(year >= 2011 & gross < 400000000)
filtered$budget <- filtered$budget/1000000
filtered$gross <- filtered$gross/1000000
graph <-ggplot(filtered, aes(x = budget, y = gross)) +
geom_point()
graph2 <- graph + geom_smooth(method='lm',formula=y~x)
ggplotly(graph2)
As can be seen from the linear regression with a confidence interval, there is a correlation between the gross profit of a movie and the budget but, when I looked at just the graph without the regression I thought that the correlation was stronger.