First we bring in all the libraries we will be using. Then we load the data set we have downloaded.
#Load in Libraries
library(tidyr)
library(readr)
library(dplyr)
library(lubridate)
library(stringr)
library(janitor)
library(ggplot2)
library(scales)
#Load in the dataset
movies_raw <- read_csv("/Users/jus10segrest/Downloads/iu indy/stat for data science/movies.csv")
The next step for our data set is to clean it and format it so that we can begin to work through it.
#create a new table separating the released column into two release date/country
movies_ <- movies_raw |>
separate(released, into = c("release_new","country_released"), sep=" \\(") |>
mutate(country_released = str_remove(country_released, "\\)$")) |> #remove the end parathensis
mutate(release_date=mdy(release_new)) |> #then change the date to an easier format
rename(country_filmed=country) #rename column for ease of understanding
movies_
## # A tibble: 7,668 × 17
## name rating genre year release_new country_released score votes director
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 The Sh… R Drama 1980 June 13, 1… United States 8.4 9.27e5 Stanley…
## 2 The Bl… R Adve… 1980 July 2, 19… United States 5.8 6.5 e4 Randal …
## 3 Star W… PG Acti… 1980 June 20, 1… United States 8.7 1.20e6 Irvin K…
## 4 Airpla… PG Come… 1980 July 2, 19… United States 7.7 2.21e5 Jim Abr…
## 5 Caddys… R Come… 1980 July 25, 1… United States 7.3 1.08e5 Harold …
## 6 Friday… R Horr… 1980 May 9, 1980 United States 6.4 1.23e5 Sean S.…
## 7 The Bl… R Acti… 1980 June 20, 1… United States 7.9 1.88e5 John La…
## 8 Raging… R Biog… 1980 December 1… United States 8.2 3.30e5 Martin …
## 9 Superm… PG Acti… 1980 June 19, 1… United States 6.8 1.01e5 Richard…
## 10 The Lo… R Biog… 1980 May 16, 19… United States 7 1 e4 Walter …
## # ℹ 7,658 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## # budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## # release_date <date>
The first data frame I created was Rating and IMDB Score. This would compare each rating provided in the data sets average score against each other.
rating_score <- movies_ |>
group_by(rating) |>
summarize(
mean_score = mean(score, na.rm = TRUE),
count=n()
)
rating_score
## # A tibble: 13 × 3
## rating mean_score count
## <chr> <dbl> <int>
## 1 Approved 3.4 1
## 2 G 6.59 153
## 3 NC-17 6.55 23
## 4 Not Rated 6.92 283
## 5 PG 6.22 1252
## 6 PG-13 6.29 2112
## 7 R 6.45 3697
## 8 TV-14 6.3 1
## 9 TV-MA 7.02 9
## 10 TV-PG 6.94 5
## 11 Unrated 6.62 52
## 12 X 6.9 3
## 13 <NA> 6.63 77
smallest_rating <- rating_score |>
filter(count == min(count))
smallest_rating
## # A tibble: 2 × 3
## rating mean_score count
## <chr> <dbl> <int>
## 1 Approved 3.4 1
## 2 TV-14 6.3 1
From this we see that the highest used ratings are R (3697), PG-13 (2112), and PG (1252) which is exactly what we would presume as in the real world those ratings are attached to the most movies that are released regularly. This is because it targets the most amount of people and have become the most common ratings used.
The lowest count includes the TV-14 and Approved ratings. I chose to ignore the TV ratings as they do not pertain to what I want out of this dataset, but the Approved rating is interesting as it is an old rating that was used before 1966 which was before PG, PG13, and R became the standard. The movie “Tarzan The Ape Man” has this rating which doesn’t make much sense as it was released in 1981 but on IMDB this is the rating attached to it as well. This could be an interesting investigation in the future.
My hypothesis for why some ratings are lower than others is viewership and in part gross revenue. This could be further researched later by comparing each rating to how much gross revenue they make and see if there is any correlation between the two variables. The presumed hypothesis would be that R, PG-13, and PG have a positive relationship with gross revenue while other ratings such as Approved and NC-17 have negative or less positivist relationships than other ratings.
Visualization for this grouping
ggplot(rating_score, mapping=aes(x = rating, y = mean_score)) +
geom_bar(stat = "identity",fill = "black") +
labs(
title = "Bar Graph of Movie Rating against their average Score",
x = "Movie Rating",
y = "Mean IMDB Score",
) +
theme_minimal()
From this bar graph we can see that the mean score for each rating is fairly similar, with the exception of the Approved rating, which would most likely be due to only having one movie associated with it. This graph shows that scores tend to even out around 6-7 which is what people tend to rate movies they feel neutral about.
The second data frame I created was Genre and Runtime. This would compare each movie genre provided in the data set against the average run time.
genre_runtime <- movies_ |>
group_by(genre) |>
summarize(
mean_runtime = mean(runtime, na.rm = TRUE),
count=n()
)
genre_runtime
## # A tibble: 19 × 3
## genre mean_runtime count
## <chr> <dbl> <int>
## 1 Action 110. 1705
## 2 Adventure 108. 427
## 3 Animation 92.2 338
## 4 Biography 120. 443
## 5 Comedy 101. 2245
## 6 Crime 112. 551
## 7 Drama 113. 1518
## 8 Family 99.9 11
## 9 Fantasy 99.4 44
## 10 History 55 1
## 11 Horror 96.3 322
## 12 Music 117 1
## 13 Musical 145 2
## 14 Mystery 116. 20
## 15 Romance 107. 10
## 16 Sci-Fi 100. 10
## 17 Sport 94 1
## 18 Thriller 98.6 16
## 19 Western 97.3 3
smallest_genre <- genre_runtime |>
filter(count == min(count))
smallest_genre
## # A tibble: 3 × 3
## genre mean_runtime count
## <chr> <dbl> <int>
## 1 History 55 1
## 2 Music 117 1
## 3 Sport 94 1
From these groupings we can see that Comedy, Action, and Drama are the top movies available in the data set. We can also see that Biographies tend to have the highest run time on average while Animation movies tend to have the shortest run time. This aligns with what I would have thought as animation costs are very high and this would lead to movies being shorter to save budget. Biographies would tend to be longer as they would have real life source material to work from and most likely would have a lot of information to cover throughout the movie.
Movies with the genre tags of Sport, Music, and History have the lowest counts. This leads directly to my hypothesis as I believe that movie genre is directly correlated with gross revenue, which would mean these specific genres are less likely to turn a profit and are less likely to be produced.
ggplot(genre_runtime, mapping=aes(x = genre, y = mean_runtime)) +
geom_bar(stat = "identity",fill = "black") +
labs(
title = "Bar Graph of Movie Genre against their average Runtimes",
x = "Movie Genre",
y = "Mean Runtime (minutes)",
) +
theme_minimal() +
coord_flip()
This bar graph shows the average run time for a movie compared to its genre. Excluding genres with low counts such as Musical and History, it is interesting to see all of the movies stay around the 90-110 range. This could be due to movie producers researching and finding that these run times lead to more people checking out a movie as it is less of a commitment.
The third data frame I created was Country Filmed and Budget. This would compare each country the movie was filmed in against the budget associated with the film.
filmed_budget <- movies_ |>
group_by(country_filmed) |>
summarize(
mean_budget = mean(budget, na.rm = TRUE),
count=n()
)
filmed_budget
## # A tibble: 60 × 3
## country_filmed mean_budget count
## <chr> <dbl> <int>
## 1 Argentina 1850000 8
## 2 Aruba 30000000 1
## 3 Australia 25736905. 92
## 4 Austria 19450000 5
## 5 Belgium 27025000 8
## 6 Brazil 10633333. 6
## 7 Canada 22522129. 190
## 8 Chile 26000000 2
## 9 China 71015385. 40
## 10 Colombia 3000000 1
## # ℹ 50 more rows
smallest_filmed <- filmed_budget |>
filter(count == min(count))
smallest_filmed
## # A tibble: 11 × 3
## country_filmed mean_budget count
## <chr> <dbl> <int>
## 1 Aruba 30000000 1
## 2 Colombia 3000000 1
## 3 Jamaica 3000000 1
## 4 Kenya 20000000 1
## 5 Lebanon 4000000 1
## 6 Libya 35000000 1
## 7 Malta 55000000 1
## 8 Panama 20000000 1
## 9 Republic of Macedonia 1900000 1
## 10 Romania NaN 1
## 11 Serbia NaN 1
From these groupings we can see that there 59 movies in the data set with 11 only having 1 movie attributed to them. The context of these results could be that these locations create obstacles that are not as difficult in other countries, resulting in less movies being made there. This could be because of restrictions on filming, travel costs, terrain of the area not being good for movies overall, and hundreds of other reasons. My hypothesis for this would be that the mean budget for these locations is higher overall than the average mean budget for the world, therefore leading to less movies being made in those locations.
ggplot(filmed_budget, mapping=aes(x = country_filmed, y = mean_budget)) +
geom_bar(stat = "identity",fill = "black") +
scale_y_continuous(labels = label_dollar()) +
labs(
title = "Country Filmed against their average Budgets",
x = "Country Filmed",
y = "Mean Budget (dollars)",
) +
theme(axis.text.x = element_text(angle = 45, hjust = 0.8, size = 4))
This bar graph shows the mean budget for movies in the data set that have been filmed in various countries. We can see that the two highest countries are China (40 movies in data set) and Finland (3) while the lowest countries include many different examples including Iran (10), Argentina (8), etc. It could be an interesting experiment to find out why certain countries have such high budgets compared to others even while not producing much more than other countries.
The two categorical variables I chose were genre and rating. I thought it would provide some interesting insights into what genre’s tend to lean towards more adult audiences and vice versa.
genre_rating <- movies_ |>
tabyl(genre, rating)
genre_rating
## genre Approved G NC-17 Not Rated PG PG-13 R TV-14 TV-MA TV-PG Unrated
## Action 0 1 0 49 181 623 843 1 1 0 0
## Adventure 1 21 0 5 205 91 102 0 1 0 0
## Animation 0 100 0 10 185 23 13 0 2 2 0
## Biography 0 1 1 12 65 135 223 0 1 0 2
## Comedy 0 16 4 47 428 738 984 0 1 1 14
## Crime 0 0 3 32 7 46 447 0 0 0 9
## Drama 0 11 14 113 155 390 767 0 3 2 27
## Family 0 3 0 0 7 0 0 0 0 0 0
## Fantasy 0 0 0 2 3 9 29 0 0 0 0
## History 0 0 0 0 0 0 0 0 0 0 0
## Horror 0 0 1 9 7 44 256 0 0 0 0
## Music 0 0 0 0 0 0 1 0 0 0 0
## Musical 0 0 0 1 0 0 0 0 0 0 0
## Mystery 0 0 0 1 0 4 15 0 0 0 0
## Romance 0 0 0 1 3 2 2 0 0 0 0
## Sci-Fi 0 0 0 1 2 3 3 0 0 0 0
## Sport 0 0 0 0 0 1 0 0 0 0 0
## Thriller 0 0 0 0 3 3 10 0 0 0 0
## Western 0 0 0 0 1 0 2 0 0 0 0
## X NA_
## 0 6
## 0 1
## 0 3
## 0 3
## 0 12
## 1 6
## 1 35
## 0 1
## 0 1
## 0 1
## 1 4
## 0 0
## 0 1
## 0 0
## 0 2
## 0 1
## 0 0
## 0 0
## 0 0
Here we have a full table of all the different genres against all the different ratings in the data set. There are many combinations that do not exist in the data for these two variables. Some of these include the action genre not having any movies that are rated Approved, NC-17, TV-PG, Unrated, and X. I think there are many different reasons for these combinations missing but the two biggest in my opinion would be that the data doesn’t contain the movies that have that combination or that those genres and ratings don’t mix together. Unrated movies are usually do to being very extreme and this would make sense that there are many less of those types of movies available then. This extends to all of the missing combinations.
I am going to focus on the most common combinations as I think from this data set that is more clear and can have more information taken away from it. We can see that Comedy and the R rating are by far the most popular with 984 movies in that category and I think this makes a lot of sense. Most comedies rely on dark jokes, crude humor, and adult language to get laughs so they will need to be rated for older audiences.
genre_rating <- movies_ |>
count(genre, rating)
ggplot(genre_rating, aes(x = genre, y = rating, fill = n)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "white", high = "black") +
theme_minimal() +
theme(panel.grid = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, size = 8)
) +
labs(title = "Movie Count by Genre and Rating", fill = "Count")
I decided to create a heat map to visualize these combinations as it showed how many missing combinations are available as well as how much of a concentration there is in certain combinations. It is obvious that the R, PG-13, and PG ratings have the most amount of movies while most other genres do not seem to have any concentration.