Module 1 Deliverable

Introduction

For Module 1, I decided to look at a set of data that was about Netflix productions from the years 2008 to 2021. The data set had information about the type (movie or TV show), title, director, cast, country produced in, date added to Netflix, the release year, the rating, the duration, the genre (listed in), as well as a description of the production. I thought it would be interesting to look through this data set and do some comparisons of ratings, genres, and countries produced in, while looking at movies vs. tv shows, as well as time data with the years that the media was produced.

Description of Project

Before looking at the summary statistics of this data, I added in month and year columns based on what year the movie was added to Netflix, because I knew that the information would come in handy as I was building my visualizations. I then looked at the summary statistics of the numerical columns in the data set. These columns were only the release year, and then the year and month columns that I added. To examine the other columns, I would need to do more manipulation of the data, but decided to wait until I needed the columns for any visualizations I was going to make.

Data Visualization

The first visualization I decided to make was a simple one examining the dates that media productions were added to Netflix. I decided that a bar chart would be the best way to do this, with different colored bars for both Movie and TV Show. I first made a new data frame based on the large netflixdf I had created in the beginning, and filtered out any missing values in the year column. I then grouped by both year and type. I then created a simple bar chart to show the data. I changed the scale on the y access so that the chart could be more easily analyzed. I found that there were a lot more movies added to Netflix in general than TV Shows, and that 2019 was the year with the most amount of media added overall. There is definitely a left skew to the data, as media being added to Netflix really didn’t pick up until around 2016.

The second visualization I made was a line plot with a different line for each type of rating that a media production could have. The x-axis had years and the y-axis was the total rankings per category. I wanted to see which of the ratings was the most popular, and which ones that Netflix could perhaps stand to release some more of. The most popular ratings by far were TV-MA and TV-14, and this graph was consistant with the previous one in showing that 2019 was Netflix’s most popular year to release movies and shows. I had to filter out some data for the making of this visualization, as there were some ratings that were “66 min”, “74 min”, and “84 min” which are not rankings but rather seem to be run times. I got rid of those in the making of my data frame so that only the true ranking categories would be shown on the chart.

The third visualization I made was a trellis chart that contains multiple pie charts. I wanted to examine genres by country, and figured creating a pie chart for each country would be an effective way to do so. I first created the visualization without any filtering of genre or country, and there were way too many genres and countries for the visualization to be effective or get any sort of message accross. I then decided I wanted to only show the top 10 countries. I decided the top 10 countries would be the ones that have released the most amount of media over the time span of this data set, so I filtered for those countries. ‘NA’ was in that top 10, so I got rid of that and then selected what the true top 10 was. I made the visualization again, but there was still too many genres involved. I decided to just look at the top 15 genres, so I filtered those out of the data and made the visualization again. Now, it was much easier to look at the charts and be able to analyze the data effectively. I finally decided to label only the slices with above 2% of the data in them so that everything would be easier to see, and added a caption at the bottom of the charts to explain that.

For the fourth visualization, I wanted to keep looking at genres, but wanted to now look at it by year instead of by country. I also wanted to separate it by movie vs. TV show, as there are different genres for each of those categories and I didn’t want to display just one chart with a bunch of gaps in it. I thought a heat map would be a good way to show this information, as it would emphasize the most popular genres with a darker color. I first made the necessary data frame, by counting up how many pieces of media were released in each genre for both movies and TV shows. I then made the heat map, using facet wrap to separate by type (movie vs TV show). The only thing I had to play around with on the heat map was the size of the labels on the boxes, as I wanted to be able to see the full numbers on all of the boxes.

The last visualization I made was to analyze the final variable that was of interest to me which was the durations of the media. Again, I thought it would be best to separate by type (movie vs TV show) as movies were listed in the data set with minutes for their run times and TV shows were listed with the number of seasons that they had. I also wanted to add a trend line because I knew there would be a lot of points on the charts and I thought it would be hard to see the pattern otherwise. Creating the scatter plot was very simple, and again, I used facet wrap to separate by type. Movies and TV shows followed the same pattern with their trend lines, starting out shorter, getting longer before 200, and then evening out again to shorter lengths in more recent times. Of course, there was more data in the years between 2016 and 2021, and there were a couple outliers on both charts as well.

#introducing needed libraries
library(ggplot2)
library(lubridate)
library(dplyr)
library(scales)
library(ggthemes)
library(RColorBrewer)
library(data.table)
library(tidyr)
library(viridis)
library(stringr)

#setting working directory and reading in the file
setwd("/Users/rubysullivan/Desktop")
filename <- "/Users/rubysullivan/Desktop/netflix_titles.csv"
netflixdf <- fread(filename, na.strings = c(NA, ""))

#seeing basic information about the file
head(netflixdf)

##    show_id    type                 title        director
##     <char>  <char>                <char>          <char>
## 1:      s1   Movie  Dick Johnson Is Dead Kirsten Johnson
## 2:      s2 TV Show         Blood & Water            <NA>
## 3:      s3 TV Show             Ganglands Julien Leclercq
## 4:      s4 TV Show Jailbirds New Orleans            <NA>
## 5:      s5 TV Show          Kota Factory            <NA>
## 6:      s6 TV Show         Midnight Mass   Mike Flanagan
##                                                                                                                                                                                                                                                                                                               cast
##                                                                                                                                                                                                                                                                                                             <char>
## 1:                                                                                                                                                                                                                                                                                                            <NA>
## 2: Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng
## 3:                                                                                                                                                             Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, Geert Van Rampelberg, Bakary Diombera
## 4:                                                                                                                                                                                                                                                                                                            <NA>
## 5:                                                                                                                                                                                                        Mayur More, Jitendra Kumar, Ranjan Raj, Alam Khan, Ahsaas Channa, Revathi Pillai, Urvi Singh, Arun Kumar
## 6:                                                                        Kate Siegel, Zach Gilford, Hamish Linklater, Henry Thomas, Kristin Lehman, Samantha Sloyan, Igby Rigney, Rahul Kohli, Annarah Cymone, Annabeth Gish, Alex Essoe, Rahul Abburi, Matt Biedel, Michael Trucco, Crystal Balint, Louis Oliver
##          country         date_added release_year rating  duration
##           <char>             <char>        <int> <char>    <char>
## 1: United States September 25, 2021         2020  PG-13    90 min
## 2:  South Africa September 24, 2021         2021  TV-MA 2 Seasons
## 3:          <NA> September 24, 2021         2021  TV-MA  1 Season
## 4:          <NA> September 24, 2021         2021  TV-MA  1 Season
## 5:         India September 24, 2021         2021  TV-MA 2 Seasons
## 6:          <NA> September 24, 2021         2021  TV-MA  1 Season
##                                                        listed_in
##                                                           <char>
## 1:                                                 Documentaries
## 2:               International TV Shows, TV Dramas, TV Mysteries
## 3: Crime TV Shows, International TV Shows, TV Action & Adventure
## 4:                                        Docuseries, Reality TV
## 5:        International TV Shows, Romantic TV Shows, TV Comedies
## 6:                            TV Dramas, TV Horror, TV Mysteries
##                                                                                                                                                 description
##                                                                                                                                                      <char>
## 1: As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.
## 2:      After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth.
## 3:       To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war.
## 4:      Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Orleans on this gritty reality series.
## 5: In a city of coaching centers known to train India’s finest collegiate minds, an earnest but unexceptional student and his friends navigate campus life.
## 6: The arrival of a charismatic young priest brings glorious miracles, ominous mysteries and renewed religious fervor to a dying town desperate to believe.

colSums(is.na(netflixdf))

##      show_id         type        title     director         cast      country 
##            0            0            0         2634          825          831 
##   date_added release_year       rating     duration    listed_in  description 
##           10            0            4            3            0            0

#creating the large data frame and adding in new columns for year and month 
netflixdf <- netflixdf %>%
  select(type, title, director, cast, country, date_added, release_year, rating, duration, listed_in, description) %>%
  mutate(year = year(mdy(date_added)), month = month(mdy(date_added))) %>%
  distinct() %>%
  data.frame()

#basic summary statistics for each column 
summary(netflixdf)

##      type              title             director             cast          
##  Length:8807        Length:8807        Length:8807        Length:8807       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    country           date_added         release_year     rating         
##  Length:8807        Length:8807        Min.   :1925   Length:8807       
##  Class :character   Class :character   1st Qu.:2013   Class :character  
##  Mode  :character   Mode  :character   Median :2017   Mode  :character  
##                                        Mean   :2014                     
##                                        3rd Qu.:2019                     
##                                        Max.   :2021                     
##                                                                         
##    duration          listed_in         description             year     
##  Length:8807        Length:8807        Length:8807        Min.   :2008  
##  Class :character   Class :character   Class :character   1st Qu.:2018  
##  Mode  :character   Mode  :character   Mode  :character   Median :2019  
##                                                           Mean   :2019  
##                                                           3rd Qu.:2020  
##                                                           Max.   :2021  
##                                                           NA's   :10    
##      month       
##  Min.   : 1.000  
##  1st Qu.: 4.000  
##  Median : 7.000  
##  Mean   : 6.655  
##  3rd Qu.:10.000  
##  Max.   :12.000  
##  NA's   :10

Visualization 1: Movies vs. TV Shows and Release Dates

#creating the data frame for the plot, filtering out missing values and grouping the data 
new_netflix <- netflixdf %>%
  mutate(year = year(mdy(date_added))) %>%
  filter(!is.na(year)) %>%
  group_by(type, year) %>%
  summarise(n = n(), .groups = 'keep') %>%
  data.frame()
new_netflix

##       type year    n
## 1    Movie 2008    1
## 2    Movie 2009    2
## 3    Movie 2010    1
## 4    Movie 2011   13
## 5    Movie 2012    3
## 6    Movie 2013    6
## 7    Movie 2014   19
## 8    Movie 2015   56
## 9    Movie 2016  253
## 10   Movie 2017  839
## 11   Movie 2018 1237
## 12   Movie 2019 1424
## 13   Movie 2020 1284
## 14   Movie 2021  993
## 15 TV Show 2008    1
## 16 TV Show 2013    5
## 17 TV Show 2014    5
## 18 TV Show 2015   26
## 19 TV Show 2016  176
## 20 TV Show 2017  349
## 21 TV Show 2018  412
## 22 TV Show 2019  592
## 23 TV Show 2020  595
## 24 TV Show 2021  505

#creating the bar chart 
ggplot(new_netflix, aes(x = factor(year), y = n, fill = type)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Dates of Movies and TV Shows Added to Netflix", 
       x = "Media Type", y = "Total Added", fill = "Year Added") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette = "Paired", guide = guide_legend(reverse = TRUE)) +
  scale_y_continuous(labels = comma, breaks = seq(0, max(new_netflix$n), by = 200))

Visualization 2: Multiple Line Plot, by Year, Showing Ratings

#creating the data frame, filtering out messy data, selecting needed columns and grouping the data 
ratingdf <- netflixdf %>%
  select(rating, date_added) %>%
  filter(!rating %in% c("66 min", "74 min", "84 min")) %>%
  mutate(year = year(mdy(date_added))) %>%
  group_by(rating, year) %>%
  summarise(n = n(), .groups = 'keep') %>%
  data.frame()

#getting rid of missing values
ratingdf <- na.omit(ratingdf)
ratingdf

##       rating year   n
## 1          G 2014   1
## 2          G 2015   1
## 3          G 2016   2
## 4          G 2017   4
## 5          G 2018  12
## 6          G 2019   8
## 7          G 2020   9
## 8          G 2021   4
## 9      NC-17 2016   1
## 10     NC-17 2017   1
## 11     NC-17 2019   1
## 12        NR 2010   1
## 13        NR 2013   4
## 14        NR 2015   5
## 15        NR 2016  27
## 16        NR 2017  24
## 17        NR 2018  14
## 18        NR 2019   4
## 20        PG 2012   1
## 21        PG 2013   1
## 22        PG 2014   3
## 23        PG 2015   2
## 24        PG 2016   3
## 25        PG 2017  19
## 26        PG 2018  33
## 27        PG 2019  81
## 28        PG 2020  86
## 29        PG 2021  58
## 30     PG-13 2015   2
## 31     PG-13 2016   6
## 32     PG-13 2017  26
## 33     PG-13 2018  53
## 34     PG-13 2019 135
## 35     PG-13 2020 122
## 36     PG-13 2021 146
## 37         R 2012   1
## 38         R 2015   3
## 39         R 2016  14
## 40         R 2017  66
## 41         R 2018 129
## 42         R 2019 208
## 43         R 2020 188
## 44         R 2021 190
## 45     TV-14 2011   5
## 46     TV-14 2013   2
## 47     TV-14 2014   2
## 48     TV-14 2015  14
## 49     TV-14 2016  98
## 50     TV-14 2017 326
## 51     TV-14 2018 451
## 52     TV-14 2019 494
## 53     TV-14 2020 439
## 54     TV-14 2021 326
## 56      TV-G 2013   1
## 57      TV-G 2014   1
## 58      TV-G 2015   5
## 59      TV-G 2016   9
## 60      TV-G 2017  23
## 61      TV-G 2018  36
## 62      TV-G 2019  40
## 63      TV-G 2020  61
## 64      TV-G 2021  44
## 65     TV-MA 2008   2
## 66     TV-MA 2009   2
## 67     TV-MA 2011   3
## 68     TV-MA 2013   3
## 69     TV-MA 2014  12
## 70     TV-MA 2015  29
## 71     TV-MA 2016 162
## 72     TV-MA 2017 446
## 73     TV-MA 2018 650
## 74     TV-MA 2019 736
## 75     TV-MA 2020 671
## 76     TV-MA 2021 489
## 78     TV-PG 2011   5
## 79     TV-PG 2012   1
## 80     TV-PG 2014   4
## 81     TV-PG 2015   8
## 82     TV-PG 2016  50
## 83     TV-PG 2017 168
## 84     TV-PG 2018 184
## 85     TV-PG 2019 198
## 86     TV-PG 2020 146
## 87     TV-PG 2021  97
## 89      TV-Y 2014   1
## 90      TV-Y 2015   7
## 91      TV-Y 2016  10
## 92      TV-Y 2017  35
## 93      TV-Y 2018  40
## 94      TV-Y 2019  54
## 95      TV-Y 2020 102
## 96      TV-Y 2021  57
## 98     TV-Y7 2015   4
## 99     TV-Y7 2016  43
## 100    TV-Y7 2017  45
## 101    TV-Y7 2018  45
## 102    TV-Y7 2019  54
## 103    TV-Y7 2020  55
## 104    TV-Y7 2021  87
## 106 TV-Y7-FV 2015   2
## 107 TV-Y7-FV 2016   1
## 108 TV-Y7-FV 2017   1
## 109 TV-Y7-FV 2018   1
## 110 TV-Y7-FV 2019   1
## 111       UR 2017   1
## 112       UR 2019   2

#creating the chart 
ggplot(ratingdf, aes(x = year, y = n, group = rating)) + 
  geom_line(aes(color = rating), size = 2) +
  labs(title = "Ratings Over Time", x = "Year", y = "Total Rankings per Category", color = "Rating") + 
  theme_light() + 
  theme(plot.title = element_text(hjust = 0.5), plot.margin = margin(10, 10, 10, 10)) +
  geom_point(shape = 21, size = 3, color = "black", fill = "white") +
  scale_y_continuous(labels = comma, breaks = seq(0, max(ratingdf$n), by = 100))

Visualization 3: Multiple Pie Charts showing the top 15 genres in the top 10 countries

#selecting the top 10 countries by number of movies released
top_ten_countries <- netflixdf %>%
  select(country) %>%
  separate_rows(country, sep = ",\\s*") %>%
  group_by(country) %>%
  summarise(n = n(), .groups = 'keep') %>% 
  arrange(desc(n)) %>%
  data.frame()

#filtering out missing values
top_ten_countries <- na.omit(top_ten_countries)

#checking work 
topten <- head(top_ten_countries, n = 10)

#finding the top 15 genres overall 
top_genres <- netflixdf %>%
  separate_rows(listed_in, sep = ",\\s*") %>%
  count(listed_in, sort = TRUE) %>%  
  slice_head(n = 15) %>%  
  pull(listed_in)

#creating the data frame with all necessary information for the genre counts in the top 10 countries
genre_counts <- netflixdf %>%
  separate_rows(country, sep = ",\\s*") %>% 
  filter(country %in% topten$country) %>%
  separate_rows(listed_in, sep = ",\\s*") %>% 
  filter(listed_in %in% top_genres) %>%
  count(country, listed_in) %>%
  group_by(country) %>%
  mutate(percent = round(100 * n / sum(n),1)) %>%
  data.frame()

#creating the pie charts 
ggplot(data = genre_counts, aes(x = "", y = n, fill = listed_in)) +
  geom_bar(stat = "identity", position = "fill") +
  coord_polar(theta = "y", start = 0) +
  labs(fill = "Genres", x = NULL, y = NULL, 
       title = "Most Popular Genres in Countries that Have Produced the Most Movies", 
       caption = "Slices under 2% are not labeled") +
  theme_light() + 
  theme(plot.title = element_text(hjust = 0.5),
        axis.text = element_blank(), 
        axis.ticks = element_blank(),
        panel.grid = element_blank()) +
  facet_wrap(~country, ncol = 4, nrow = 3) +
  scale_fill_viridis(discrete = TRUE, option = "B") +
  geom_text(aes(x = 1.7, label = ifelse(percent > 2, paste0(round(percent, 1), "%"), "")),
            size = 2,
            position = position_fill(vjust = 0.5))

Visualization 4: Heatmap showing the most popular genres for movies vs. tv shows

#creating a data frame showing the most popular genres after separating genres by commas and getting rid of missing values 
top_genres <- netflixdf %>%
  separate_rows(listed_in, sep = ",\\s*") %>%
  count(type, year, listed_in, sort = TRUE) %>%  
  filter(!is.na(year) & !is.na(n)) %>% 
  data.frame()

#creating the heat maps, one for movies and one for tv shows 
ggplot(top_genres, aes(x = year, y = listed_in, fill = n)) +
  geom_tile(color = "black") +
  geom_text(aes(label = comma(n)), size = 3) +
  facet_wrap(~type, scales = "free_y") +
  labs(title = "Heatmap: Genre Popularity by Year (Movies vs. TV Shows)", 
       x = "Year", 
       y = "Genre",
       fill = "Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_continuous(low = "white", high = "hotpink")

Visualization 5: Line graph for average speed of two teams in any given gameID

#creating the data frame finding the numeric values for the run times 
vis5df <- netflixdf %>%
  mutate(duration = as.numeric(str_extract(duration, "\\d+"))) %>%
  filter(!is.na(duration) & !is.na(release_year))  

#creating the scatter plots, one for movies and one for tv shows with trendlines 
ggplot(vis5df, aes(x = release_year, y = duration, color = type)) +
  geom_jitter(alpha = 0.5, size = 2) +  
  geom_smooth(method = "loess", se = FALSE, color = "black", linetype = "dashed") +
  facet_wrap(~type, scales = "free_y") +
  labs(title = "Duration of Movies and TV Shows Over Time",
       x = "Release Year",
       y = "Duration (Minutes / Number of Seasons)",
       color = "Type") + 
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Conclusion

Overall, it was a lot of fun to look at this Netflix data set. There was a lot of information to explore, and it was a little difficult at times to comb through the data, as there were a couple columns that needed to be separated by commas and some columns that I needed to add myself using mutate. However, I think the charts I created paint a good picture of the data. I pulled out some interesting information about the data provided that shows trends over time for this data, as well as comparing movies vs TV shows.