Setting up R and Loading Data set

First we bring in all the libraries we will be using. Then we load the data set we have downloaded.

#Load in Libraries
library(tidyr)
library(readr)
library(dplyr)
library(lubridate)
library(stringr)
library(janitor)
library(ggplot2)
library(scales)

#Load in the dataset
movies_raw <- read_csv("/Users/jus10segrest/Downloads/iu indy/stat for data science/movies.csv")

The next step for our data set is to clean it and format it so that we can begin to work through it.

#create a new table separating the released column into two release date/country
movies_ <- movies_raw |>
  separate(released, into = c("release_new","country_released"), sep=" \\(") |>
  mutate(country_released = str_remove(country_released, "\\)$")) |>    #remove the end parathensis
  mutate(release_date=mdy(release_new)) |>         #then change the date to an easier format
  rename(country_filmed=country)            #rename column for ease of understanding
  
movies_
## # A tibble: 7,668 × 17
##    name    rating genre  year release_new country_released score  votes director
##    <chr>   <chr>  <chr> <dbl> <chr>       <chr>            <dbl>  <dbl> <chr>   
##  1 The Sh… R      Drama  1980 June 13, 1… United States      8.4 9.27e5 Stanley…
##  2 The Bl… R      Adve…  1980 July 2, 19… United States      5.8 6.5 e4 Randal …
##  3 Star W… PG     Acti…  1980 June 20, 1… United States      8.7 1.20e6 Irvin K…
##  4 Airpla… PG     Come…  1980 July 2, 19… United States      7.7 2.21e5 Jim Abr…
##  5 Caddys… R      Come…  1980 July 25, 1… United States      7.3 1.08e5 Harold …
##  6 Friday… R      Horr…  1980 May 9, 1980 United States      6.4 1.23e5 Sean S.…
##  7 The Bl… R      Acti…  1980 June 20, 1… United States      7.9 1.88e5 John La…
##  8 Raging… R      Biog…  1980 December 1… United States      8.2 3.30e5 Martin …
##  9 Superm… PG     Acti…  1980 June 19, 1… United States      6.8 1.01e5 Richard…
## 10 The Lo… R      Biog…  1980 May 16, 19… United States      7   1   e4 Walter …
## # ℹ 7,658 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## #   budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## #   release_date <date>

Group_By Data Frames

First Group

The first data frame I created was Rating and IMDB Score. This would compare each rating provided in the data sets average score against each other.

rating_score <- movies_ |>
  group_by(rating) |>
  summarize(
    mean_score = mean(score, na.rm = TRUE),
    count=n()
  )
rating_score
## # A tibble: 13 × 3
##    rating    mean_score count
##    <chr>          <dbl> <int>
##  1 Approved        3.4      1
##  2 G               6.59   153
##  3 NC-17           6.55    23
##  4 Not Rated       6.92   283
##  5 PG              6.22  1252
##  6 PG-13           6.29  2112
##  7 R               6.45  3697
##  8 TV-14           6.3      1
##  9 TV-MA           7.02     9
## 10 TV-PG           6.94     5
## 11 Unrated         6.62    52
## 12 X               6.9      3
## 13 <NA>            6.63    77
smallest_rating <- rating_score |>
  filter(count == min(count))
smallest_rating
## # A tibble: 2 × 3
##   rating   mean_score count
##   <chr>         <dbl> <int>
## 1 Approved        3.4     1
## 2 TV-14           6.3     1

From this we see that the highest used ratings are R (3697), PG-13 (2112), and PG (1252) which is exactly what we would presume as in the real world those ratings are attached to the most movies that are released regularly. This is because it targets the most amount of people and have become the most common ratings used.

The lowest count includes the TV-14 and Approved ratings. I chose to ignore the TV ratings as they do not pertain to what I want out of this dataset, but the Approved rating is interesting as it is an old rating that was used before 1966 which was before PG, PG13, and R became the standard. The movie “Tarzan The Ape Man” has this rating which doesn’t make much sense as it was released in 1981 but on IMDB this is the rating attached to it as well. This could be an interesting investigation in the future.

My hypothesis for why some ratings are lower than others is viewership and in part gross revenue. This could be further researched later by comparing each rating to how much gross revenue they make and see if there is any correlation between the two variables. The presumed hypothesis would be that R, PG-13, and PG have a positive relationship with gross revenue while other ratings such as Approved and NC-17 have negative or less positivist relationships than other ratings.

Visualization for this grouping

ggplot(rating_score, mapping=aes(x = rating, y = mean_score)) +
  geom_bar(stat = "identity",fill = "black") +
  labs(
    title = "Bar Graph of Movie Rating against their average Score",
    x = "Movie Rating",
    y = "Mean IMDB Score",
  ) +
  theme_minimal()

From this bar graph we can see that the mean score for each rating is fairly similar, with the exception of the Approved rating, which would most likely be due to only having one movie associated with it. This graph shows that scores tend to even out around 6-7 which is what people tend to rate movies they feel neutral about.

Second Group

The second data frame I created was Genre and Runtime. This would compare each movie genre provided in the data set against the average run time.

genre_runtime <- movies_ |>
  group_by(genre) |>
  summarize(
    mean_runtime = mean(runtime, na.rm = TRUE),
    count=n()
  )
genre_runtime
## # A tibble: 19 × 3
##    genre     mean_runtime count
##    <chr>            <dbl> <int>
##  1 Action           110.   1705
##  2 Adventure        108.    427
##  3 Animation         92.2   338
##  4 Biography        120.    443
##  5 Comedy           101.   2245
##  6 Crime            112.    551
##  7 Drama            113.   1518
##  8 Family            99.9    11
##  9 Fantasy           99.4    44
## 10 History           55       1
## 11 Horror            96.3   322
## 12 Music            117       1
## 13 Musical          145       2
## 14 Mystery          116.     20
## 15 Romance          107.     10
## 16 Sci-Fi           100.     10
## 17 Sport             94       1
## 18 Thriller          98.6    16
## 19 Western           97.3     3
smallest_genre <- genre_runtime |>
  filter(count == min(count))
smallest_genre
## # A tibble: 3 × 3
##   genre   mean_runtime count
##   <chr>          <dbl> <int>
## 1 History           55     1
## 2 Music            117     1
## 3 Sport             94     1

From these groupings we can see that Comedy, Action, and Drama are the top movies available in the data set. We can also see that Biographies tend to have the highest run time on average while Animation movies tend to have the shortest run time. This aligns with what I would have thought as animation costs are very high and this would lead to movies being shorter to save budget. Biographies would tend to be longer as they would have real life source material to work from and most likely would have a lot of information to cover throughout the movie.

Movies with the genre tags of Sport, Music, and History have the lowest counts. This leads directly to my hypothesis as I believe that movie genre is directly correlated with gross revenue, which would mean these specific genres are less likely to turn a profit and are less likely to be produced.

ggplot(genre_runtime, mapping=aes(x = genre, y = mean_runtime)) +
  geom_bar(stat = "identity",fill = "black") +
  labs(
    title = "Bar Graph of Movie Genre against their average Runtimes",
    x = "Movie Genre",
    y = "Mean Runtime (minutes)",
  ) +
  theme_minimal() +
  coord_flip()

This bar graph shows the average run time for a movie compared to its genre. Excluding genres with low counts such as Musical and History, it is interesting to see all of the movies stay around the 90-110 range. This could be due to movie producers researching and finding that these run times lead to more people checking out a movie as it is less of a commitment.

Third Group

The third data frame I created was Country Filmed and Budget. This would compare each country the movie was filmed in against the budget associated with the film.

filmed_budget <- movies_ |>
  group_by(country_filmed) |>
  summarize(
    mean_budget = mean(budget, na.rm = TRUE),
    count=n()
  )
filmed_budget
## # A tibble: 60 × 3
##    country_filmed mean_budget count
##    <chr>                <dbl> <int>
##  1 Argentina         1850000      8
##  2 Aruba            30000000      1
##  3 Australia        25736905.    92
##  4 Austria          19450000      5
##  5 Belgium          27025000      8
##  6 Brazil           10633333.     6
##  7 Canada           22522129.   190
##  8 Chile            26000000      2
##  9 China            71015385.    40
## 10 Colombia          3000000      1
## # ℹ 50 more rows
smallest_filmed <- filmed_budget |>
  filter(count == min(count))
smallest_filmed
## # A tibble: 11 × 3
##    country_filmed        mean_budget count
##    <chr>                       <dbl> <int>
##  1 Aruba                    30000000     1
##  2 Colombia                  3000000     1
##  3 Jamaica                   3000000     1
##  4 Kenya                    20000000     1
##  5 Lebanon                   4000000     1
##  6 Libya                    35000000     1
##  7 Malta                    55000000     1
##  8 Panama                   20000000     1
##  9 Republic of Macedonia     1900000     1
## 10 Romania                       NaN     1
## 11 Serbia                        NaN     1

From these groupings we can see that there 59 movies in the data set with 11 only having 1 movie attributed to them. The context of these results could be that these locations create obstacles that are not as difficult in other countries, resulting in less movies being made there. This could be because of restrictions on filming, travel costs, terrain of the area not being good for movies overall, and hundreds of other reasons. My hypothesis for this would be that the mean budget for these locations is higher overall than the average mean budget for the world, therefore leading to less movies being made in those locations.

ggplot(filmed_budget, mapping=aes(x = country_filmed, y = mean_budget)) +
  geom_bar(stat = "identity",fill = "black") +
  scale_y_continuous(labels = label_dollar()) +
  labs(
    title = "Country Filmed against their average Budgets",
    x = "Country Filmed",
    y = "Mean Budget (dollars)",
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 0.8, size = 4))

This bar graph shows the mean budget for movies in the data set that have been filmed in various countries. We can see that the two highest countries are China (40 movies in data set) and Finland (3) while the lowest countries include many different examples including Iran (10), Argentina (8), etc. It could be an interesting experiment to find out why certain countries have such high budgets compared to others even while not producing much more than other countries.

Categorical Variable Combinations

The two categorical variables I chose were genre and rating. I thought it would provide some interesting insights into what genre’s tend to lean towards more adult audiences and vice versa.

genre_rating <- movies_ |>
  tabyl(genre, rating)
genre_rating
##      genre Approved   G NC-17 Not Rated  PG PG-13   R TV-14 TV-MA TV-PG Unrated
##     Action        0   1     0        49 181   623 843     1     1     0       0
##  Adventure        1  21     0         5 205    91 102     0     1     0       0
##  Animation        0 100     0        10 185    23  13     0     2     2       0
##  Biography        0   1     1        12  65   135 223     0     1     0       2
##     Comedy        0  16     4        47 428   738 984     0     1     1      14
##      Crime        0   0     3        32   7    46 447     0     0     0       9
##      Drama        0  11    14       113 155   390 767     0     3     2      27
##     Family        0   3     0         0   7     0   0     0     0     0       0
##    Fantasy        0   0     0         2   3     9  29     0     0     0       0
##    History        0   0     0         0   0     0   0     0     0     0       0
##     Horror        0   0     1         9   7    44 256     0     0     0       0
##      Music        0   0     0         0   0     0   1     0     0     0       0
##    Musical        0   0     0         1   0     0   0     0     0     0       0
##    Mystery        0   0     0         1   0     4  15     0     0     0       0
##    Romance        0   0     0         1   3     2   2     0     0     0       0
##     Sci-Fi        0   0     0         1   2     3   3     0     0     0       0
##      Sport        0   0     0         0   0     1   0     0     0     0       0
##   Thriller        0   0     0         0   3     3  10     0     0     0       0
##    Western        0   0     0         0   1     0   2     0     0     0       0
##  X NA_
##  0   6
##  0   1
##  0   3
##  0   3
##  0  12
##  1   6
##  1  35
##  0   1
##  0   1
##  0   1
##  1   4
##  0   0
##  0   1
##  0   0
##  0   2
##  0   1
##  0   0
##  0   0
##  0   0

Here we have a full table of all the different genres against all the different ratings in the data set. There are many combinations that do not exist in the data for these two variables. Some of these include the action genre not having any movies that are rated Approved, NC-17, TV-PG, Unrated, and X. I think there are many different reasons for these combinations missing but the two biggest in my opinion would be that the data doesn’t contain the movies that have that combination or that those genres and ratings don’t mix together. Unrated movies are usually do to being very extreme and this would make sense that there are many less of those types of movies available then. This extends to all of the missing combinations.

I am going to focus on the most common combinations as I think from this data set that is more clear and can have more information taken away from it. We can see that Comedy and the R rating are by far the most popular with 984 movies in that category and I think this makes a lot of sense. Most comedies rely on dark jokes, crude humor, and adult language to get laughs so they will need to be rated for older audiences.

genre_rating <- movies_ |>
  count(genre, rating)

ggplot(genre_rating, aes(x = genre, y = rating, fill = n)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "white", high = "black") +
  theme_minimal() + 
  theme(panel.grid = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8) 
  ) +
  labs(title = "Movie Count by Genre and Rating", fill = "Count")

I decided to create a heat map to visualize these combinations as it showed how many missing combinations are available as well as how much of a concentration there is in certain combinations. It is obvious that the R, PG-13, and PG ratings have the most amount of movies while most other genres do not seem to have any concentration.