library(dplyr)
library(ggplot2)
library(plotly)
data <- read.csv("netflix1.csv")
movies <- data %>% filter(type == "Movie")
movies$duration <- as.numeric(gsub("min", "", movies$duration))

Base R

Investigation

There are many countries that produce movies that are available on Netflix. I want to figure out which of these are most prominent. To best display this I am going to use a bar chart as we are using the countries (categorical) and their counts (numerical).

only_countries <- unlist(strsplit(as.character(movies$country), ", "))
only_countries <- only_countries[only_countries != "Not Given"]
country_counts <- sort(table(only_countries), decreasing = TRUE)
top_countries <- country_counts[1:10]
p1 <- barplot(top_countries,
              las = 2,
              main = "Top 10 Countries by Number of Netflix Films",
              xlab = "Country",
              ylab = "Number of Films",
              cex.names = 0.7,
              col = "lightblue",
)

Discussion

From the graph, I could see that the most prominent country by number of films is the United States by a wide margin. As a result of this I will only be moving forward for the rest of my charts using films from the United States. I have found from the graph that most countries do not have that many movies on Netflix, at least in this dataset. I am sure that depending on where the dataset is scraped from, in terms of location, would have an effect on which countries are most common. Limiting the dataset to only the United States could also help eliminate many actors and directors that do not often appear in other films hosted on the site. This should limit decrease the overall size of the data and would be beneficial.

ggplot2

Plot 1

Investigation

The question I am trying to answer with the next graph is that of, “Have movies changed in length over the years?” This question itself stems much thought. For instance, has the influx of accessibility lead to more short indie films? Have recent long blockbusters such as Wicked or Dune lead to new trends or are they themselves part of a trend? These types of questions are ones I hope to have answered as part of this graph

USMovies <- movies %>% filter(country=="United States")
USMovies$Genre <- gsub(",.*", "", USMovies$listed_in)

year_avg <- USMovies %>%
  group_by(release_year) %>%
  summarise(avg_duration = mean(duration))

ggplot(year_avg, aes(x = release_year, y = avg_duration)) +
  geom_point(
    data = USMovies %>% filter(duration <= 220),
    aes(x = release_year, y = duration),
    color = "gray60",
    alpha = 0.2,
    size = 1.8
  ) +
  geom_line(color = "darkcyan", linewidth = 1.5) +
  labs(
    title = "Average Movie Duration v Release Year",
    x = "Release Year",
    y = "Avg. Movie Duration (Minutes)"
  ) +
  geom_text(
    data = year_avg %>% filter(avg_duration == max(avg_duration)),
    aes(label = avg_duration),
    hjust = -0.5
  ) +
  theme_minimal()

Discussion

The most shocking thing I took away from this graph was that the highest average duration occurred all the way back in the 1960s. I then added a scatter plot of the duration of movies for each year to the graph which helped clarify why this is. There are much fewer old films on the chart compared to new ones. This then helped me answer my investigation question I bit in a few ways. One, there has not been a notable increase in length of movies, instead I think that there are much more movies which brings the extremes to a point that aligns with the historical view of the data. I do think that this increase in points over the years may be attributed to the overall digital age we are in and that old movies are only digitized if they are deemed good enough to digitize, leading to a lot of that media being unable to be stored on Netflix.

Plot 2

Investigation

Another question I had based around this data is how the TV/Movie rating (PG, R, etc.) is dispersed differently in each Genre. To answer this question I decided to use a 100% stacked bar chart.

USMovies <- USMovies %>%
  mutate(rating_grouped = case_when(
    rating %in% c("PG", "PG-13", "TV-PG", "TV-14") ~ "PG",
    rating %in% c("TV-Y7-FV", "TV-Y", "TV-Y7") ~ "Kids",
    rating %in% c("G", "TV-G") ~ "General",
    rating %in% c("R", "NC-17", "TV-MA") ~ "Adult",
    rating %in% c("NR", "UR") ~ "Not Rated",
    TRUE ~ rating
  ))

Rated <- USMovies %>% filter(rating_grouped != "Not Rated")

top_genres <- Rated %>%
  count(Genre, sort = TRUE) %>%
  slice_head(n = 10)

Rated10 <- Rated %>%
  filter(Genre %in% top_genres$Genre)

ggplot(Rated10, aes(x = Genre, fill = rating_grouped)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Proportion of TV Ratings by Genre",
    x = "Genre",
    y = "Percentage of Movies",
    fill = "TV Rating"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Discussion

Before I get to my takeaways, I decided to limit the chart to the top 10 most common genres. This revealed a genre called “Movies” which I found to be a genre used to describe movies made for TV shows, as it had movies from the Batman cartoons, behind the scenes of TV show, and more. My real takeaways for this graph are that there are not many General Audience movies being made, the majority of the split is between Adult and PG. I found it interesting that the only Genre on the list that had a higher PG proportion than Adult proportion is that of Classic Movies. This could reflect how the world has changed in its view of harsh language in media, meaning that the world is moving awat from the PG label in the modern era and embracing harshness.

Plotly

Investigation

The next question I posed was if the average duration of a film changed by the Genre.To do so I am using a bar chart that plots the average duration on the y axis and the genre on the x axis. I used plotly for this so I could add interactive hovers for each bar. I also decided to color the bars by genre so that it is easier to see the difference between each one.

genre_avg <- USMovies %>%
  group_by(Genre) %>%
  summarise(
    avg_duration = mean(duration),
    total_titles = n()
  ) %>%
  arrange(desc(avg_duration))

plot_ly(
  data = genre_avg,
  x = ~reorder(Genre, avg_duration),
  y = ~avg_duration,
  color = ~Genre,
  type = "bar",
  hovertext = ~paste(
    "Genre:", Genre,
    "<br>Avg Duration:", round(avg_duration, 1), "min",
    "<br>Total Titles:", total_titles
  ),
  hoverinfo = "text"
) %>%
  layout(
    title = "Average Movie Duration by Genre",
    xaxis = list(title = "Genre"),
    yaxis = list(title = "Average Duration (Minutes)"),
    showlegend = FALSE
  )

Discussion

This graph was interesting to me because it does show that there is a difference in length between genres. Notably, kids movies were quite low on the list although they have some of the highest counts of movies, which on hover we can see to be 330 titles. I also noticed that movies based in reality or music are shorter on average, with stand-up comedy, documentaries, and musicals all being lower. The fact that most movies fall around the same average shows to me that over time we have perfected movie length. Meaning that we have found the maximum length directors can create while still getting good return on the investment.

Overall Discussion of the Dataset

In the past I have used more rating focused datasets for movies. This dataset was lacking in those which limited the graphs that I could create and made it so that I had to be more creative with my graph types. I am satisfied with the information and insights I was able to get out of the data that I found nonetheless.