Introduction

The aim of this project is to utilize Rstudio for data visualization purposes. Netflix titles from Tidy Tuesday is used for this project, which can be found here. In this project I will be exploring the Movies and TV shows in Netflix.

Setting up the Environement

Loading Libraries

Loading Data

The data is called Netflix titles, the source of the data is Tidy Tuesday.

# Loading the data

netflix <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-20/netflix_titles.csv')

Defining the theme

This data set contains r count_rows observations and r count_cols columns. The main columns are content type, director, cast, country, date added to Netflix, release year, rating, listed in and description.

Data Cleaning

In order to clean the data, I transformed the date_added column to the month, day and year format and created another column for the year in which the content was added. Furthermore, duration columns has information about duration of the Movie or TV show which is measured in minutes and seasons consequentially. I split the column into duration and duration unit. The reason is movies and TV shows are measured in different units. As a result, I transformed the duration column to the numeric. Moreover I added a new column for content details which combination of type of the content and the genres the movie is listed in.

Creating New Table

In the below section, I created another data table which contains the country, year and number of contents added to the Netflix. Moreover, I also created a function to remove null values. The Netflix data set contains information about different ratings, I created a vector of of MPA, Motion, Picture and Animation, ratings.

I created another table to split all the words in the description column and created another data set with all the world information in the the new column called word.

# creating world correlations

words <- netflix %>%
  unnest_tokens(word, description) %>%
  anti_join(stop_words, by = "word")

Descriptive Summary

The Netflix Content Type summary indicates that there are 69.05% of observations are Movies compared to 30.95% TV Shows. Moreover, Rating categories shows a wide range of movie and TV Show rating in the Netflix data set.

# data summary
# Data summary for type
datasummary((` Type` = type) ~ N + Percent(), data = netflix, title = "Netflix Contnet Type")

Netflix Contnet Type
Type	N	Percent
Movie	5377	69.05
TV Show	2410	30.95

# Data summary for rating 
datasummary((`Rating` = rating )~ N + Percent(), data = netflix, title = "Rating Categories")

Rating Categories
Rating	N	Percent
G	39	0.50
NC-17	3	0.04
NR	84	1.08
PG	247	3.17
PG-13	386	4.96
R	665	8.54
TV-14	1931	24.80
TV-G	194	2.49
TV-MA	2863	36.77
TV-PG	806	10.35
TV-Y	280	3.60
TV-Y7	271	3.48
TV-Y7-FV	6	0.08
UR	5	0.06

The Duration summary table Movies duration is measured in minutes and TV Shows duration is measured in number of seasons. It can be seen that average duration for a movie is 99 minutes, and the average length of a TV Show is around two seasons.

# data summary for Duration and type
datasummary((`Type` = type)*(`Duration` = duration) ~ Min + Max + Mean + Median + N , data = netflix, title = "Duration Summary")

Duration Summary
Type		Min	Max	Mean	Median	N
Movie	Duration	3.00	312.00	99.31	98.00	5377
TV Show	Duration	1.00	16.00	1.78	1.00	2410

# Data summary for type and yeat
datasummary((`Type` = type)*(`Year added` = year_added) ~ Min + Max + N , data = netflix, title = "Rating Categories")

Rating Categories
Type		Min	Max	N
Movie	Year added	2008.00	2021.00	5377
TV Show	Year added	2008.00	2021.00	2400

Distributions

Netflix content distribution shows the existence of more movies than TV shows.

types

## Warning: Removed 10 rows containing non-finite values (stat_count).

Movies duration is measured in minutes, the below distribution has a normal distribution, it seems that the average duration of a movies is around 100 minutes with is close to the duration mean from the summary table

# movies duration distribution
ggplot(data=netflix[netflix$type == "Movie", ], aes(x=duration)) +
  geom_bar(fill = "#440154", alpha = 0.8) +
  xlab("Duration (in minutes) ") +
  ylab("Count")+
  labs(title = "Netflix Movies Duration Distribution") +
    theme(legend.position = "top", 
        panel.border = element_blank(), axis.text=element_text(size=8), 
        plot.title = element_text(size = 12L, face = "bold", hjust = 0.5), 
        panel.background = element_rect(fill = NA) ) +
  theme_g()

TV shows distribution shows that many shows have one season, moreover the distribution has a right long tail and it is skewed.

# tv shows seasons distribution 

ggplot(data=netflix[netflix$type == "TV Show", ], aes(x=duration)) +
  geom_bar(fill = "#440154", alpha = 0.8) +
  xlab("Number of Seasons") +
  ylab("Count")+
  labs(title = "Netflix TV Shows Seasons Distribution") +
    theme(legend.position = "top", 
        panel.border = element_blank(), axis.text=element_text(size=8), 
        plot.title = element_text(size = 12L, face = "bold", hjust = 0.5), panel.background = element_rect(fill = NA) ) +
  theme_g()

The below graphs shows that United States is the largest contributor to the Netflix content.

# top countries content providers graphs
ggplot(countries[ , .(country, count)][, .(total = sum(count)), by = country ][order(-total)][head(1:10)], aes(x = country, y = total)) + 
  geom_col(fill = "#440154", alpha = 0.8) + 
  xlab("Country") +
  ylab("Number of contents") +
  labs(title = "Number of Netflix Content per Country") + 
  theme(legend.position = "top", 
        panel.border = element_blank(), axis.text=element_text(size=8), 
        plot.title = element_text(size = 12L, face = "bold", hjust = 0.5), panel.background = element_rect(fill = NA) ) +
  transition_states(country) + theme_g()

In addition to the data summary, description column contains a large amount of information about the details of a content, and it provides a great amount of information about the movie or the TV show. It is interesting to see how the words are related to each other such as mother, daughter, son and others

# creating words relationship
words %>%
  distinct(type, title, word) %>%
  add_count(word, name = "word_total") %>%
  filter(word_total >= 40) %>%
  pairwise_cor(word, title, sort = TRUE) %>%
  filter(correlation >= .1) %>%
  igraph::graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(alpha = correlation)) +
  geom_node_point() +
  geom_node_text(aes(label = name),
                 repel = TRUE) +
  theme(legend.position = "none", panel.background = element_rect(fill = NA), axis.ticks = "", axis.text = "") + theme_g() + 
  xlab("") +
  ylab("")

# the code source is provided in the reference

word_genre_log_odds <- words %>%
  distinct(type, title, word, genre = listed_in) %>%
  add_count(word, name = "word_total") %>%
  filter(word_total >= 25) %>%
  separate_rows(genre, sep = ", ") %>%
  filter(fct_lump(genre, 9) != "Other") %>%
  count(genre, word) %>%
  bind_log_odds(genre, word, n)

word_genre_log_odds %>%
  group_by(genre) %>%
  top_n(10, log_odds_weighted) %>%
  ungroup() %>%
  mutate(word = reorder_within(word, log_odds_weighted, genre)) %>%
  ggplot(aes(log_odds_weighted, word, fill = genre)) +
  geom_col() +
  scale_fill_viridis(discrete = T, direction = -1) +
  facet_wrap(~ genre, scales = "free_y") +
  scale_y_reordered() +
  theme(legend.position = "none") +
  labs(x = "Log-odds of word's specificity to genre",
       y = "") +
  theme_g()

Mian Tasks

In the below section, I will focusing on the main questions of this project.

1- How have the production of Movies and TV shows changed across years?

The below curve clearly indicate that there has been very few movies and TV shows released before 2000. As the animated line graphs show that many movies and TV shows were released after 2000. Regarding the decline of the lines, as the data latest date added of the content is January 2021, thus, there has not been many movies in the first month of 2021.

#How the production of Movies and TV shows have changed across years?
netflix <- as.data.table(netflix)

ggplot(netflix[, .(count = .N), by = .(type, release_year)], aes(x=release_year, y=count, group=type, color=type)) +
  geom_line() +
  geom_point() +
  scale_color_viridis(discrete = TRUE) +
  ggtitle("Netflix Movies and TV shows per year") +
  ylab("Number of Movies / TV shows") +
  xlab("Release Year") +
  labs(color = "Type", group = "Type") +
  theme(legend.position = "top", panel.background = element_rect(fill = NA),
        panel.border = element_blank(), axis.text=element_text(size=8), 
        plot.title = element_text(size = 12L, face = "bold", hjust = 0.5) ) +
  transition_reveal(release_year) +
  theme_g()

2- How has the average movie duration changed across the decades?

It can be seen from the below plot that the before 1950s movies were shorter. However, during the 1960s to 2000 the duration of movies were longer, as it can be seen, that movies duration is graduation declining on average.

#How the average movie duration has changed across the decades?

ggplot(netflix[type == "Movie"], aes(decade, duration, group = decade, fill = decade)) +
  xlab("Decade") +
  ylab("Duration") +
  labs(fill = "Decade", group = "Decade", title = "Movies Duration across Decades") +
  geom_boxplot() +
  theme(legend.position = "top", panel.background = element_rect(fill = NA),
        panel.border = element_blank(), axis.text=element_text(size=8), 
        plot.title = element_text(size = 12L, face = "bold", hjust = 0.5) ) +
  scale_fill_viridis() +
  transition_reveal(decade) + theme_g()

3- How do movies duration change based on the target audience?

The movies are rated as following

G, all people can see it,
PG, parents must see it with a child or it evaluate it,
PG-13, parents must evaluate it, if child is under 13,
R, parents must evaluate if child is under 18,
NC-17, audience must be over 17

The below violin graph shows that the movies which are rate as G, or PG-13 tend to be shorter, thus, we can say that movies which children can watch are shorter on average.

# 3- How movies duration changes basend on the target audience?
ggplot(netflix[type == "Movie", ][rating %in% c("G", "PG", "PG-13", "R", "NC-17"), 
                           rating2 := factor(rating, levels = rev(c("G", "PG", "PG-13", "R", "NC-17")))][!is.na(rating),][!is.na(rating2)], aes(x = rating2, y = duration, fill = rating2)) +
  geom_violin() +
  geom_hline(yintercept = 99, linetype = 2) +
  coord_flip() +
  scale_fill_viridis_d() +
  theme(legend.position = "top", panel.background = element_rect(fill = NA),
        panel.border = element_blank(), axis.text=element_text(size=8), 
        plot.title = element_text(size = 12L, face = "bold", hjust = 0.5) ) +
  labs(x = "Film rating", y = "Film duration (minutes)",
       title = "Movies Duration Dased on Trget Audience", fill = "Rating") + theme_g()

4- How are the genres clustered?

The below figure shows the clusters of genres. It can be seen from the clusters that movies are common for family and children. Moreover, the largest genre cluster contains thrillers, crimes, horror, reality and many other genres.

library(tm)
# building corpus
corpus <- Corpus(VectorSource(netflix$listed_in))

# create term document matrix
tdm <- TermDocumentMatrix(corpus, 
                          control = list(minWordLength=c(1,Inf)))
# convert to matrix
m <- as.matrix(tdm)

# Hierarchical word clustering using dendrogram
distance <- dist(scale(m))
hc <- hclust(distance, method = "ward.D")

# Circular
Circ = fviz_dend(hc, cex = 0.7, lwd = 0.5, k = 5,
                 rect = TRUE,
                 k_colors = c("#440154", "#3b528b", "#21918c", "#5ec962", "#fde725"),
                 rect_border = c("#440154", "#3b528b", "#21918c", "#5ec962", "#fde725"),
                 rect_fill = TRUE,
                 type = "circular",
                 ylab = "")
Circ

5- When did each countries’ content add to the Netflix?

The below map shows that based on the available data in 2008, United States was the only content contributor to the Netflix, As the time passes other countries across the globe joined Netflix.

# changing  to numeric
countries$count <- as.numeric(countries$count)

# creating hover 
countries <- countries %>% mutate(hover = paste0(country, "\n", year))

# removing nulls
countries <- drop_na(countries)
# creating type of map
g <- list(
  projection = list(
    type = 'natural earth'
  ),
  showland = TRUE,
  landcolor = toRGB("#F9F9F9")
)

# ploting the map
plot_geo(countries, 
         locationmode = "ISO-3",
         frame = ~year) %>% 
  add_trace(locations = ~iso3,
            z = ~count,
            zmin = 1,
            zmax = max(countries$count),
            color = ~count,
            text = ~hover,
            hoverinfo = "text") %>% 
  layout(geo = g,
         title = "Countries added to Netflix\n2008-2021")

Conclusion

Based on the visualization and text analysis of the Netflix data, it can be seen that movies have a larger part in the Netflix content. Furthermore, the data summaries and visualization indicated that Netflix started contributing it Content since 2008 and United states was the only country back in time. Moreover, Netflix movies and TV shows are rated and these ratings are standardized for different audience. From duration and ranting categories analysis it can be seen that movies which children can watch, on average are shorter.

Final Project : Tidy-Tuesday - Netflix titles

Ghazal Ayobi

2/12/2022