The aim of this project is to utilize Rstudio for data visualization purposes. Netflix titles from Tidy Tuesday is used for this project, which can be found here. In this project I will be exploring the Movies and TV shows in Netflix.
# loading libraries
library(data.table)
library(tidytuesdayR)
library(tidyverse)
library(scales)
library(kableExtra)
library(gganimate)
library(tidytext)
library(viridis)
library(plotly)
library(lubridate)
library(ggplot2)
library(hrbrthemes)
library(dplyr)
library(modelsummary)
library(readr)
library(igraph)
library(ggraph)
library(snakecase)
library(tidylo)
library(widyr)
library(tidygraph)
library(ggraph)
library(tm)
library(factoextra)
library(rpubs)
The data is called Netflix titles, the source of the data is Tidy Tuesday.
# Loading the data
netflix <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-20/netflix_titles.csv')
theme_g <- function(){
font <- "Arial" #assign font family up front
theme_bw() %+replace% #replace elements we want to change
theme(
legend.position = "top", #legend position
panel.background = element_blank(), # panel background to NA
panel.border = element_blank(), # setting panel border
#grid elements
panel.grid.major = element_blank(), #strip major gridlines
panel.grid.minor = element_blank(), #strip minor gridlines
axis.ticks = element_blank(), #strip axis ticks
#text elements
plot.title = element_text( #title
family = font, #set font family
size = 12L, #set font size
face = 'bold', #bold typeface
hjust = 0.5, #left align
vjust = 2), #raise slightly
plot.caption = element_text( #caption
family = font, #font family
size = 9, #font size
hjust = 1), #right align
axis.title = element_text( #axis titles
family = font, #font family
size = 10), #font size
axis.text = element_text( #axis text
family = font, #axis famuly
size = 8), #font size
)
}
count_rows <- count(netflix)
count_cols <- ncol(netflix)
This data set contains r count_rows observations and r count_cols columns. The main columns are content type, director, cast, country, date added to Netflix, release year, rating, listed in and description.
In order to clean the data, I transformed the date_added column to the month, day and year format and created another column for the year in which the content was added. Furthermore, duration columns has information about duration of the Movie or TV show which is measured in minutes and seasons consequentially. I split the column into duration and duration unit. The reason is movies and TV shows are measured in different units. As a result, I transformed the duration column to the numeric. Moreover I added a new column for content details which combination of type of the content and the genres the movie is listed in.
# Transforming to data table
netflix <- as.data.table(netflix)
# Formatting the date added column
netflix <- netflix[, date_added := mdy(date_added)]
# Adding a new column for the year the content is added
netflix <- netflix[, year_added := year(date_added)]
# Adding new columns for of duration duration unit, year added.
netflix <- netflix[, c('duration','duration_unit') := do.call(Map, c(f = c, strsplit(duration, ' '))) ]
# changing the duration column to numeric
netflix$duration <- as.numeric(netflix$duration)
# Creating a new column for the content details
netflix$content_details <- paste0(netflix$type, ", ", netflix$listed_in)
netflix[type == "Movie", decade := 10 * (release_year %/% 10) ]
netflix$decade <- as.numeric(netflix$decade)
In the below section, I created another data table which contains the country, year and number of contents added to the Netflix. Moreover, I also created a function to remove null values. The Netflix data set contians infromation about different ratings, I created a vector of of MPA, Motion, Picture and Animation, ratings.
# creating new data table of countries to be used for further mapping
countries <- netflix[, .(count = .N), by = .(country, year_added)]
countries <- drop_na(countries)
# extracting each country and the year their content added to the Netflix
countries <- countries %>%
ungroup()%>%
separate_rows(country,sep = ",")%>%
mutate(
country=str_trim(country)
)%>%
group_by(year_added,country)%>%
summarize(
count=n()
)%>%
ungroup()%>%
filter(country!='NA',country!="" )%>%
arrange(year_added,desc(count))
# Create function to drop null values
row.has.na <- apply(countries, 1, function(x){any(is.na(x))})
sum(row.has.na)
## [1] 0
countries <- countries[!row.has.na,]
# adding the country codes
countries <- as.data.table(countries)
countries <- countries[, iso2 := countrycode::countryname(country,destination = "iso2c")]
countries <- countries[, iso3 := countrycode::countryname(country,destination = "iso3c")]
# adding the country codes
# renaming the column year_added
colnames(countries)[colnames(countries) == "year_added"] <- "year"
# Defining MPA rating, Motion Picture and animation film rating system
MPA_ratings <- c("G", "PG", "PG-13", "R", "NC-17")
I created another table to split all the words in the description column and created another data set with all the world information in the the new column called word.
# creating world correlations
words <- netflix %>%
unnest_tokens(word, description) %>%
anti_join(stop_words, by = "word")
The Netflix Content Type summary indicates that there are 69.05% of observations are Movies compared to 30.95% TV Shows. Moreover, Rating categories shows a wide range of movie and TV Show rating in the Netflix data set.
# data summary
# Data summary for type
datasummary((` Type` = type) ~ N + Percent(), data = netflix, title = "Netflix Contnet Type")
| Type | N | Percent |
|---|---|---|
| Movie | 5377 | 69.05 |
| TV Show | 2410 | 30.95 |
# Data summary for rating
datasummary((`Rating` = rating )~ N + Percent(), data = netflix, title = "Rating Categories")
| Rating | N | Percent |
|---|---|---|
| G | 39 | 0.50 |
| NC-17 | 3 | 0.04 |
| NR | 84 | 1.08 |
| PG | 247 | 3.17 |
| PG-13 | 386 | 4.96 |
| R | 665 | 8.54 |
| TV-14 | 1931 | 24.80 |
| TV-G | 194 | 2.49 |
| TV-MA | 2863 | 36.77 |
| TV-PG | 806 | 10.35 |
| TV-Y | 280 | 3.60 |
| TV-Y7 | 271 | 3.48 |
| TV-Y7-FV | 6 | 0.08 |
| UR | 5 | 0.06 |
The Duration summary table Movies duration is measured in minutes and TV Shows duration is measured in number of seasons. It can be seen that average duration for a movie is 99 minutes, and the average length of a TV Show is around two seasons.
# data summary for Duration and type
datasummary((`Type` = type)*(`Duration` = duration) ~ Min + Max + Mean + Median + N , data = netflix, title = "Duration Summary")
| Type | Min | Max | Mean | Median | N | |
|---|---|---|---|---|---|---|
| Movie | Duration | 3.00 | 312.00 | 99.31 | 98.00 | 5377 |
| TV Show | Duration | 1.00 | 16.00 | 1.78 | 1.00 | 2410 |
# Data summary for type and yeat
datasummary((`Type` = type)*(`Year added` = year_added) ~ Min + Max + N , data = netflix, title = "Rating Categories")
| Type | Min | Max | N | |
|---|---|---|---|---|
| Movie | Year added | 2008.00 | 2021.00 | 5377 |
| TV Show | Year added | 2008.00 | 2021.00 | 2400 |
Netflix content distribution shows the existence of more movies than TV shows.
# content distribution
ggplot(data=netflix, aes(x=year_added, fill = type)) +
geom_bar() +
xlab("Year") +
ylab("Count")+
labs(title = "Netflix Content Distribution", fill = "Type") +
scale_fill_viridis(discrete = T, alpha = 0.9) +
theme(legend.position = "top",
panel.border = element_blank(), axis.text=element_text(size=8),
plot.title = element_text(size = 12L, face = "bold", hjust = 0.5),
panel.background = element_rect(fill = NA) ) +
theme_g()
Movies duration is measured in minutes, the below distribution has a normal distribution, it seems that the average duration of a movies is around 100 minutes with is close to the duration mean from the summary table
# movies duration distribution
ggplot(data=netflix[netflix$type == "Movie", ], aes(x=duration)) +
geom_bar(fill = "#440154", alpha = 0.8) +
xlab("Duration (in minutes) ") +
ylab("Count")+
labs(title = "Netflix Movies Duration Distribution") +
theme(legend.position = "top",
panel.border = element_blank(), axis.text=element_text(size=8),
plot.title = element_text(size = 12L, face = "bold", hjust = 0.5),
panel.background = element_rect(fill = NA) ) +
theme_g()
TV shows distribution shows that many shows have one season, moreover the distribution has a right long tail and it is skewed.
# tv shows seasons distribution
ggplot(data=netflix[netflix$type == "TV Show", ], aes(x=duration)) +
geom_bar(fill = "#440154", alpha = 0.8) +
xlab("Number of Seasons") +
ylab("Count")+
labs(title = "Netflix TV Shows Seasons Distribution") +
theme(legend.position = "top",
panel.border = element_blank(), axis.text=element_text(size=8),
plot.title = element_text(size = 12L, face = "bold", hjust = 0.5), panel.background = element_rect(fill = NA) ) +
theme_g()
The below graphs shows that United States is the largest contributor to the Netflix content.
# top countries content providers graphs
ggplot(countries[ , .(country, count)][, .(total = sum(count)), by = country ][order(-total)][head(1:10)], aes(x = country, y = total)) +
geom_col(fill = "#440154", alpha = 0.8) +
xlab("Country") +
ylab("Number of contents") +
labs(title = "Number of Netflix Content per Country") +
theme(legend.position = "top",
panel.border = element_blank(), axis.text=element_text(size=8),
plot.title = element_text(size = 12L, face = "bold", hjust = 0.5), panel.background = element_rect(fill = NA) ) + theme_g()
In addition to the data summary, description column contains a large amount of information about the details of a content, and it provides a great amount of information about the movie or the TV show. It is interesting to see how the words are related to each other such as mother, daughter, son and others
# creating words relationship
words %>%
distinct(type, title, word) %>%
add_count(word, name = "word_total") %>%
filter(word_total >= 40) %>%
pairwise_cor(word, title, sort = TRUE) %>%
filter(correlation >= .1) %>%
igraph::graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = correlation)) +
geom_node_point() +
geom_node_text(aes(label = name),
repel = TRUE) +
theme(legend.position = "none", panel.background = element_rect(fill = NA), axis.ticks = "", axis.text = "") + theme_g() +
xlab("") +
ylab("")
# the code source is provided in the reference
In the below section, I will focusing on the main questions of this project.
The below curve clearly indicate that there has been very few movies and TV shows released before 2000. As the animated line graphs show that many movies and TV shows were released after 2000. Regarding the decline of the lines, as the data latest date added of the content is January 2021, thus, there has not been many movies in the first month of 2021.
#How the production of Movies and TV shows have changed across years?
netflix <- as.data.table(netflix)
ggplot(netflix[, .(count = .N), by = .(type, release_year)], aes(x=release_year, y=count, group=type, color=type)) +
geom_line() +
geom_point() +
scale_color_viridis(discrete = TRUE) +
ggtitle("Netflix Movies and TV shows per year") +
ylab("Number of Movies / TV shows") +
xlab("Release Year") +
labs(color = "Type", group = "Type") +
theme(legend.position = "top", panel.background = element_rect(fill = NA),
panel.border = element_blank(), axis.text=element_text(size=8),
plot.title = element_text(size = 12L, face = "bold", hjust = 0.5) ) +
transition_reveal(release_year) +
theme_g()
It can be seen from the below plot that the before 1950s movies were shorter. However, during the 1960s to 2000 the duration of movies were longer, as it can be seen, that movies duration is graduation declining on average.
#How the average movie duration has changed across the decades?
ggplot(netflix[type == "Movie"], aes(decade, duration, group = decade, fill = decade)) +
xlab("Decade") +
ylab("Duration") +
labs(fill = "Decade", group = "Decade", title = "Movies Duration across Decades") +
geom_boxplot() +
theme(legend.position = "top", panel.background = element_rect(fill = NA),
panel.border = element_blank(), axis.text=element_text(size=8),
plot.title = element_text(size = 12L, face = "bold", hjust = 0.5) ) +
scale_fill_viridis() +
transition_reveal(decade) + theme_g()
The movies are rated as following
G, all people can see it,
PG, parents must see it with a child or it evaluate it,
PG-13, parents must evaluate it, if child is under 13,
R, parents must evaluate if child is under 18,
NC-17, audience must be over 17
The below violin graph shows that the movies which are rate as G, or PG-13 tend to be shorter, thus, we can say that movies which children can watch are shorter on average.
# 3- How movies duration changes basend on the target audience?
ggplot(netflix[type == "Movie", ][rating %in% c("G", "PG", "PG-13", "R", "NC-17"),
rating2 := factor(rating, levels = rev(c("G", "PG", "PG-13", "R", "NC-17")))][!is.na(rating),][!is.na(rating2)], aes(x = rating2, y = duration, fill = rating2)) +
geom_violin() +
geom_hline(yintercept = 99, linetype = 2) +
coord_flip() +
scale_fill_viridis_d() +
theme(legend.position = "top", panel.background = element_rect(fill = NA),
panel.border = element_blank(), axis.text=element_text(size=8),
plot.title = element_text(size = 12L, face = "bold", hjust = 0.5) ) +
labs(x = "Film rating", y = "Film duration (minutes)",
title = "Movies Duration Dased on Trget Audience", fill = "Rating") + theme_g()
The below figure shows the clusters of genres. It can be seen from the clusters that movies are common for family and children. Moreover, the largest genre cluster contains thrillers, crimes, horror, reality and many other genres.
library(tm)
# building corpus
corpus <- Corpus(VectorSource(netflix$listed_in))
# create term document matrix
tdm <- TermDocumentMatrix(corpus,
control = list(minWordLength=c(1,Inf)))
# convert to matrix
m <- as.matrix(tdm)
# Hierarchical word clustering using dendrogram
distance <- dist(scale(m))
hc <- hclust(distance, method = "ward.D")
# Circular
Circ = fviz_dend(hc, cex = 0.7, lwd = 0.5, k = 5,
rect = TRUE,
k_colors = c("#440154", "#3b528b", "#21918c", "#5ec962", "#fde725"),
rect_border = c("#440154", "#3b528b", "#21918c", "#5ec962", "#fde725"),
rect_fill = TRUE,
type = "circular",
ylab = "")
Circ
The below map shows that based on the available data in 2008, United States was the only content contributor to the Netflix, As the time passes other countries across the globe joined Netflix.
# changing to numeric
countries$count <- as.numeric(countries$count)
# creating hover
countries <- countries %>% mutate(hover = paste0(country, "\n", year))
# removing nulls
countries <- drop_na(countries)
# creating type of map
g <- list(
projection = list(
type = 'natural earth'
),
showland = TRUE,
landcolor = toRGB("#F9F9F9")
)
# ploting the map
plot_geo(countries,
locationmode = "ISO-3",
frame = ~year) %>%
add_trace(locations = ~iso3,
z = ~count,
zmin = 1,
zmax = max(countries$count),
color = ~count,
text = ~hover,
hoverinfo = "text") %>%
layout(geo = g,
title = "Countries added to Netflix\n2008-2021")
Based on the visualization and text analysis of the Netflix data, it can be seen that movies have a larger part in the Netflix content. Furthermore, the data summaries and visualization indicated that Netflix started contributing it Content since 2008 and United states was the only country back in time. Moreover, Netflix movies and TV shows are rated and these ratings are standardized for different audience. From duration and ranting categories analysis it can be seen that movies which children can watch, on average are shorter.