For Module 1, I decided to look at a set of data that was about Netflix productions from the years 2008 to 2021. The data set had information about the type (movie or TV show), title, director, cast, country produced in, date added to Netflix, the release year, the rating, the duration, the genre (listed in), as well as a description of the production. I thought it would be interesting to look through this data set and do some comparisons of ratings, genres, and countries produced in, while looking at movies vs. tv shows, as well as time data with the years that the media was produced.
Before looking at the summary statistics of this data, I added in month and year columns based on what year the movie was added to Netflix, because I knew that the information would come in handy as I was building my visualizations. I then looked at the summary statistics of the numerical columns in the data set. These columns were only the release year, and then the year and month columns that I added. To examine the other columns, I would need to do more manipulation of the data, but decided to wait until I needed the columns for any visualizations I was going to make.
The first visualization I decided to make was a simple one examining the dates that media productions were added to Netflix. I decided that a bar chart would be the best way to do this, with different colored bars for both Movie and TV Show. I first made a new data frame based on the large netflixdf I had created in the beginning, and filtered out any missing values in the year column. I then grouped by both year and type. I then created a simple bar chart to show the data. I changed the scale on the y access so that the chart could be more easily analyzed. I found that there were a lot more movies added to Netflix in general than TV Shows, and that 2019 was the year with the most amount of media added overall. There is definitely a left skew to the data, as media being added to Netflix really didn’t pick up until around 2016.
The second visualization I made was a line plot with a different line for each type of rating that a media production could have. The x-axis had years and the y-axis was the total rankings per category. I wanted to see which of the ratings was the most popular, and which ones that Netflix could perhaps stand to release some more of. The most popular ratings by far were TV-MA and TV-14, and this graph was consistant with the previous one in showing that 2019 was Netflix’s most popular year to release movies and shows. I had to filter out some data for the making of this visualization, as there were some ratings that were “66 min”, “74 min”, and “84 min” which are not rankings but rather seem to be run times. I got rid of those in the making of my data frame so that only the true ranking categories would be shown on the chart.
The third visualization I made was a trellis chart that contains multiple pie charts. I wanted to examine genres by country, and figured creating a pie chart for each country would be an effective way to do so. I first created the visualization without any filtering of genre or country, and there were way too many genres and countries for the visualization to be effective or get any sort of message accross. I then decided I wanted to only show the top 10 countries. I decided the top 10 countries would be the ones that have released the most amount of media over the time span of this data set, so I filtered for those countries. ‘NA’ was in that top 10, so I got rid of that and then selected what the true top 10 was. I made the visualization again, but there was still too many genres involved. I decided to just look at the top 15 genres, so I filtered those out of the data and made the visualization again. Now, it was much easier to look at the charts and be able to analyze the data effectively. I finally decided to label only the slices with above 2% of the data in them so that everything would be easier to see, and added a caption at the bottom of the charts to explain that.
For the fourth visualization, I wanted to keep looking at genres, but wanted to now look at it by year instead of by country. I also wanted to separate it by movie vs. TV show, as there are different genres for each of those categories and I didn’t want to display just one chart with a bunch of gaps in it. I thought a heat map would be a good way to show this information, as it would emphasize the most popular genres with a darker color. I first made the necessary data frame, by counting up how many pieces of media were released in each genre for both movies and TV shows. I then made the heat map, using facet wrap to separate by type (movie vs TV show). The only thing I had to play around with on the heat map was the size of the labels on the boxes, as I wanted to be able to see the full numbers on all of the boxes.
The last visualization I made was to analyze the final variable that was of interest to me which was the durations of the media. Again, I thought it would be best to separate by type (movie vs TV show) as movies were listed in the data set with minutes for their run times and TV shows were listed with the number of seasons that they had. I also wanted to add a trend line because I knew there would be a lot of points on the charts and I thought it would be hard to see the pattern otherwise. Creating the scatter plot was very simple, and again, I used facet wrap to separate by type. Movies and TV shows followed the same pattern with their trend lines, starting out shorter, getting longer before 200, and then evening out again to shorter lengths in more recent times. Of course, there was more data in the years between 2016 and 2021, and there were a couple outliers on both charts as well.
#introducing needed libraries
library(ggplot2)
library(lubridate)
library(dplyr)
library(scales)
library(ggthemes)
library(RColorBrewer)
library(data.table)
library(tidyr)
library(viridis)
library(stringr)
#setting working directory and reading in the file
setwd("/Users/rubysullivan/Desktop")
filename <- "/Users/rubysullivan/Desktop/netflix_titles.csv"
netflixdf <- fread(filename, na.strings = c(NA, ""))
#seeing basic information about the file
head(netflixdf)
## show_id type title director
## <char> <char> <char> <char>
## 1: s1 Movie Dick Johnson Is Dead Kirsten Johnson
## 2: s2 TV Show Blood & Water <NA>
## 3: s3 TV Show Ganglands Julien Leclercq
## 4: s4 TV Show Jailbirds New Orleans <NA>
## 5: s5 TV Show Kota Factory <NA>
## 6: s6 TV Show Midnight Mass Mike Flanagan
## cast
## <char>
## 1: <NA>
## 2: Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng
## 3: Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, Geert Van Rampelberg, Bakary Diombera
## 4: <NA>
## 5: Mayur More, Jitendra Kumar, Ranjan Raj, Alam Khan, Ahsaas Channa, Revathi Pillai, Urvi Singh, Arun Kumar
## 6: Kate Siegel, Zach Gilford, Hamish Linklater, Henry Thomas, Kristin Lehman, Samantha Sloyan, Igby Rigney, Rahul Kohli, Annarah Cymone, Annabeth Gish, Alex Essoe, Rahul Abburi, Matt Biedel, Michael Trucco, Crystal Balint, Louis Oliver
## country date_added release_year rating duration
## <char> <char> <int> <char> <char>
## 1: United States September 25, 2021 2020 PG-13 90 min
## 2: South Africa September 24, 2021 2021 TV-MA 2 Seasons
## 3: <NA> September 24, 2021 2021 TV-MA 1 Season
## 4: <NA> September 24, 2021 2021 TV-MA 1 Season
## 5: India September 24, 2021 2021 TV-MA 2 Seasons
## 6: <NA> September 24, 2021 2021 TV-MA 1 Season
## listed_in
## <char>
## 1: Documentaries
## 2: International TV Shows, TV Dramas, TV Mysteries
## 3: Crime TV Shows, International TV Shows, TV Action & Adventure
## 4: Docuseries, Reality TV
## 5: International TV Shows, Romantic TV Shows, TV Comedies
## 6: TV Dramas, TV Horror, TV Mysteries
## description
## <char>
## 1: As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.
## 2: After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth.
## 3: To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war.
## 4: Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Orleans on this gritty reality series.
## 5: In a city of coaching centers known to train India’s finest collegiate minds, an earnest but unexceptional student and his friends navigate campus life.
## 6: The arrival of a charismatic young priest brings glorious miracles, ominous mysteries and renewed religious fervor to a dying town desperate to believe.
colSums(is.na(netflixdf))
## show_id type title director cast country
## 0 0 0 2634 825 831
## date_added release_year rating duration listed_in description
## 10 0 4 3 0 0
#creating the large data frame and adding in new columns for year and month
netflixdf <- netflixdf %>%
select(type, title, director, cast, country, date_added, release_year, rating, duration, listed_in, description) %>%
mutate(year = year(mdy(date_added)), month = month(mdy(date_added))) %>%
distinct() %>%
data.frame()
#basic summary statistics for each column
summary(netflixdf)
## type title director cast
## Length:8807 Length:8807 Length:8807 Length:8807
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## country date_added release_year rating
## Length:8807 Length:8807 Min. :1925 Length:8807
## Class :character Class :character 1st Qu.:2013 Class :character
## Mode :character Mode :character Median :2017 Mode :character
## Mean :2014
## 3rd Qu.:2019
## Max. :2021
##
## duration listed_in description year
## Length:8807 Length:8807 Length:8807 Min. :2008
## Class :character Class :character Class :character 1st Qu.:2018
## Mode :character Mode :character Mode :character Median :2019
## Mean :2019
## 3rd Qu.:2020
## Max. :2021
## NA's :10
## month
## Min. : 1.000
## 1st Qu.: 4.000
## Median : 7.000
## Mean : 6.655
## 3rd Qu.:10.000
## Max. :12.000
## NA's :10
#creating the data frame for the plot, filtering out missing values and grouping the data
new_netflix <- netflixdf %>%
mutate(year = year(mdy(date_added))) %>%
filter(!is.na(year)) %>%
group_by(type, year) %>%
summarise(n = n(), .groups = 'keep') %>%
data.frame()
new_netflix
## type year n
## 1 Movie 2008 1
## 2 Movie 2009 2
## 3 Movie 2010 1
## 4 Movie 2011 13
## 5 Movie 2012 3
## 6 Movie 2013 6
## 7 Movie 2014 19
## 8 Movie 2015 56
## 9 Movie 2016 253
## 10 Movie 2017 839
## 11 Movie 2018 1237
## 12 Movie 2019 1424
## 13 Movie 2020 1284
## 14 Movie 2021 993
## 15 TV Show 2008 1
## 16 TV Show 2013 5
## 17 TV Show 2014 5
## 18 TV Show 2015 26
## 19 TV Show 2016 176
## 20 TV Show 2017 349
## 21 TV Show 2018 412
## 22 TV Show 2019 592
## 23 TV Show 2020 595
## 24 TV Show 2021 505
#creating the bar chart
ggplot(new_netflix, aes(x = factor(year), y = n, fill = type)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Dates of Movies and TV Shows Added to Netflix",
x = "Media Type", y = "Total Added", fill = "Year Added") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Paired", guide = guide_legend(reverse = TRUE)) +
scale_y_continuous(labels = comma, breaks = seq(0, max(new_netflix$n), by = 200))
#creating the data frame, filtering out messy data, selecting needed columns and grouping the data
ratingdf <- netflixdf %>%
select(rating, date_added) %>%
filter(!rating %in% c("66 min", "74 min", "84 min")) %>%
mutate(year = year(mdy(date_added))) %>%
group_by(rating, year) %>%
summarise(n = n(), .groups = 'keep') %>%
data.frame()
#getting rid of missing values
ratingdf <- na.omit(ratingdf)
ratingdf
## rating year n
## 1 G 2014 1
## 2 G 2015 1
## 3 G 2016 2
## 4 G 2017 4
## 5 G 2018 12
## 6 G 2019 8
## 7 G 2020 9
## 8 G 2021 4
## 9 NC-17 2016 1
## 10 NC-17 2017 1
## 11 NC-17 2019 1
## 12 NR 2010 1
## 13 NR 2013 4
## 14 NR 2015 5
## 15 NR 2016 27
## 16 NR 2017 24
## 17 NR 2018 14
## 18 NR 2019 4
## 20 PG 2012 1
## 21 PG 2013 1
## 22 PG 2014 3
## 23 PG 2015 2
## 24 PG 2016 3
## 25 PG 2017 19
## 26 PG 2018 33
## 27 PG 2019 81
## 28 PG 2020 86
## 29 PG 2021 58
## 30 PG-13 2015 2
## 31 PG-13 2016 6
## 32 PG-13 2017 26
## 33 PG-13 2018 53
## 34 PG-13 2019 135
## 35 PG-13 2020 122
## 36 PG-13 2021 146
## 37 R 2012 1
## 38 R 2015 3
## 39 R 2016 14
## 40 R 2017 66
## 41 R 2018 129
## 42 R 2019 208
## 43 R 2020 188
## 44 R 2021 190
## 45 TV-14 2011 5
## 46 TV-14 2013 2
## 47 TV-14 2014 2
## 48 TV-14 2015 14
## 49 TV-14 2016 98
## 50 TV-14 2017 326
## 51 TV-14 2018 451
## 52 TV-14 2019 494
## 53 TV-14 2020 439
## 54 TV-14 2021 326
## 56 TV-G 2013 1
## 57 TV-G 2014 1
## 58 TV-G 2015 5
## 59 TV-G 2016 9
## 60 TV-G 2017 23
## 61 TV-G 2018 36
## 62 TV-G 2019 40
## 63 TV-G 2020 61
## 64 TV-G 2021 44
## 65 TV-MA 2008 2
## 66 TV-MA 2009 2
## 67 TV-MA 2011 3
## 68 TV-MA 2013 3
## 69 TV-MA 2014 12
## 70 TV-MA 2015 29
## 71 TV-MA 2016 162
## 72 TV-MA 2017 446
## 73 TV-MA 2018 650
## 74 TV-MA 2019 736
## 75 TV-MA 2020 671
## 76 TV-MA 2021 489
## 78 TV-PG 2011 5
## 79 TV-PG 2012 1
## 80 TV-PG 2014 4
## 81 TV-PG 2015 8
## 82 TV-PG 2016 50
## 83 TV-PG 2017 168
## 84 TV-PG 2018 184
## 85 TV-PG 2019 198
## 86 TV-PG 2020 146
## 87 TV-PG 2021 97
## 89 TV-Y 2014 1
## 90 TV-Y 2015 7
## 91 TV-Y 2016 10
## 92 TV-Y 2017 35
## 93 TV-Y 2018 40
## 94 TV-Y 2019 54
## 95 TV-Y 2020 102
## 96 TV-Y 2021 57
## 98 TV-Y7 2015 4
## 99 TV-Y7 2016 43
## 100 TV-Y7 2017 45
## 101 TV-Y7 2018 45
## 102 TV-Y7 2019 54
## 103 TV-Y7 2020 55
## 104 TV-Y7 2021 87
## 106 TV-Y7-FV 2015 2
## 107 TV-Y7-FV 2016 1
## 108 TV-Y7-FV 2017 1
## 109 TV-Y7-FV 2018 1
## 110 TV-Y7-FV 2019 1
## 111 UR 2017 1
## 112 UR 2019 2
#creating the chart
ggplot(ratingdf, aes(x = year, y = n, group = rating)) +
geom_line(aes(color = rating), size = 2) +
labs(title = "Ratings Over Time", x = "Year", y = "Total Rankings per Category", color = "Rating") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5), plot.margin = margin(10, 10, 10, 10)) +
geom_point(shape = 21, size = 3, color = "black", fill = "white") +
scale_y_continuous(labels = comma, breaks = seq(0, max(ratingdf$n), by = 100))
#selecting the top 10 countries by number of movies released
top_ten_countries <- netflixdf %>%
select(country) %>%
separate_rows(country, sep = ",\\s*") %>%
group_by(country) %>%
summarise(n = n(), .groups = 'keep') %>%
arrange(desc(n)) %>%
data.frame()
#filtering out missing values
top_ten_countries <- na.omit(top_ten_countries)
#checking work
topten <- head(top_ten_countries, n = 10)
#finding the top 15 genres overall
top_genres <- netflixdf %>%
separate_rows(listed_in, sep = ",\\s*") %>%
count(listed_in, sort = TRUE) %>%
slice_head(n = 15) %>%
pull(listed_in)
#creating the data frame with all necessary information for the genre counts in the top 10 countries
genre_counts <- netflixdf %>%
separate_rows(country, sep = ",\\s*") %>%
filter(country %in% topten$country) %>%
separate_rows(listed_in, sep = ",\\s*") %>%
filter(listed_in %in% top_genres) %>%
count(country, listed_in) %>%
group_by(country) %>%
mutate(percent = round(100 * n / sum(n),1)) %>%
data.frame()
#creating the pie charts
ggplot(data = genre_counts, aes(x = "", y = n, fill = listed_in)) +
geom_bar(stat = "identity", position = "fill") +
coord_polar(theta = "y", start = 0) +
labs(fill = "Genres", x = NULL, y = NULL,
title = "Most Popular Genres in Countries that Have Produced the Most Movies",
caption = "Slices under 2% are not labeled") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()) +
facet_wrap(~country, ncol = 4, nrow = 3) +
scale_fill_viridis(discrete = TRUE, option = "B") +
geom_text(aes(x = 1.7, label = ifelse(percent > 2, paste0(round(percent, 1), "%"), "")),
size = 2,
position = position_fill(vjust = 0.5))
#creating a data frame showing the most popular genres after separating genres by commas and getting rid of missing values
top_genres <- netflixdf %>%
separate_rows(listed_in, sep = ",\\s*") %>%
count(type, year, listed_in, sort = TRUE) %>%
filter(!is.na(year) & !is.na(n)) %>%
data.frame()
#creating the heat maps, one for movies and one for tv shows
ggplot(top_genres, aes(x = year, y = listed_in, fill = n)) +
geom_tile(color = "black") +
geom_text(aes(label = comma(n)), size = 3) +
facet_wrap(~type, scales = "free_y") +
labs(title = "Heatmap: Genre Popularity by Year (Movies vs. TV Shows)",
x = "Year",
y = "Genre",
fill = "Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_continuous(low = "white", high = "hotpink")
#creating the data frame finding the numeric values for the run times
vis5df <- netflixdf %>%
mutate(duration = as.numeric(str_extract(duration, "\\d+"))) %>%
filter(!is.na(duration) & !is.na(release_year))
#creating the scatter plots, one for movies and one for tv shows with trendlines
ggplot(vis5df, aes(x = release_year, y = duration, color = type)) +
geom_jitter(alpha = 0.5, size = 2) +
geom_smooth(method = "loess", se = FALSE, color = "black", linetype = "dashed") +
facet_wrap(~type, scales = "free_y") +
labs(title = "Duration of Movies and TV Shows Over Time",
x = "Release Year",
y = "Duration (Minutes / Number of Seasons)",
color = "Type") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
Overall, it was a lot of fun to look at this Netflix data set. There was a lot of information to explore, and it was a little difficult at times to comb through the data, as there were a couple columns that needed to be separated by commas and some columns that I needed to add myself using mutate. However, I think the charts I created paint a good picture of the data. I pulled out some interesting information about the data provided that shows trends over time for this data, as well as comparing movies vs TV shows.