Netflix is a streaming service that offers a wide variety of TV shows and movies. Created in America, the on-demand streaming platform is very popular as it is convenient and flexible. Users are able to view Netflix on multiple devices, allowing them to watch their favorite content whenever and wherever. According to Netflix’s revenue and usage statistics, at the end of 2020 there were about 204 million Netflix subscribers worldwide. Netflix is a global phenomenon and is only on the rise.
I watch Netflix practically everyday. I think that they do a great job of providing entertaining and relevant content. I always am able to find something to watch, and most of the time end up enjoying it too.
This is why I became curious as to how Netflix’s selection comes to be. with a huge variety of content in several different categories, I thought it would be interesting to do an analysis of Netflix’s content. There are tons of different factors to their selection such as genre, country of production, rating, release year, and duration.
The data used for this project comes from Kaggle. This dataset consists of TV Shows and Movies available on Netflix as of 2019. The Kaggle data set is collected from Flixable which is a third-party Netflix search engine, making it a reliable source.
I had to make a few adjustments in Excel with the original data set. First, I didn’t see a necessity to include the cast or directors, so I deleted those columns. Then, some TV Shows and Movies were listed under multiple countries and genres. Using Excel, I deleted any entries following a comma so that it was focused on one variable. Lastly, the date added on Netflix column included the month and day. I knew this would be hard to analyze in R, so I created a formula in Excel to only account for the year.
When it comes to how Netflix chooses the content to stream on their platform, I did a lot of research and came across a very informative blog post. Of course, Netflix needs licenses from studios in order to broadcast their content on the streaming service, but the selection of their Movies and TV Shows is definitely not done at random.
Netflix is a very data driven company. The decision into each TV Show and movie put into their selection is backed up by a ton of data. Netflix uses analytics to determine which content will best serve their users. Since they have such a large number of subscribers, Netflix is able to gather a tremendous amount of data. From this data, Netflix is able to make better decisions.
Since Netflix is an American company, I was curious to view the role the home-base country plays on the international streaming platform. Netflix’s media library is home to a lot of different countries’ works- but since Netflix is an American company, does it hold mostly American produced content?
My analysis will look at the TV shows and movies on Netflix as of 2019. Although I won’t be able to determine the reasoning for why each TV Show and Movie was selected to be apart of Netflix’s platform, I will be able to view the number of TV Shows and Movies on Netflix, their categorized genres, ratings and release year. In addition to analyzing Netflix’s 2019 data as a whole, I’ll provide which content for each TV Shows and Movies section is the oldest.
From there, I’ll be able to look into which countries have the most content on the streaming platform when it comes to number of TV Shows and Movies produced from their locations. I will also be able to analyze that country’s genre, ratings and release year to see if the specific country takes up the majority of Netflix’s content.
library(tidytext)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readxl)
library(rmarkdown)
netflix_titles <- read_excel("~/Desktop/netflix_titles.xlsx")
## New names:
## * `` -> ...2
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * `` -> ...7
## * ...
colnames(netflix_titles) <- c('Type','Title','Country', 'notincludeddate','ReleaseYear','Rating','Duration', 'Genre','Added')
The first topics I want to look at have to do with Netflix’s 2019 selection as a whole.
netflix_titles %>%
filter(Type%in%"TV Show") %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 2410
In 2019, there were 2,410 TV Shows to select from on Netflix.
netflix_titles %>%
filter(Type%in%"Movie") %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 5377
In 2019, there were 5,377 Movies to select from on Netflix.
From here, I can analyze the genres of the two types of content on Netflix.netflix_titles %>%
filter(Type%in%"TV Show") -> NetflixTV
NetflixTV %>% filter(!Genre %in% "TV Shows")%>% ggplot(aes(Genre, "TV Show",fill=Genre)) + geom_col() + coord_flip()
For TV Shows on Netflix, it was found that the majority of the genres were international TV Shows. This was interesting to see since Netflix is an American company.
netflix_titles %>%
filter(Type%in%"TV Show" & Genre%in% "International TV Shows") %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 690
A total of 689 International TV Shows were on Netflix in 2019.
netflix_titles %>%
filter(Type%in%"Movie") -> NetflixMOV
ggplot(NetflixMOV, aes(Genre, "Movie",fill=Genre)) + geom_col() + coord_flip()
For Movies on Netflix, it was found that the majority of the genres were Dramas.
netflix_titles %>%
filter(Type%in%"Movie" & Genre%in% "Dramas") %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 1384
A total of 1,384 Drama Movies were on Netflix in 2019.
After looking at the genres, it’s important to highlight the different ratings for the content.
ggplot(NetflixTV, aes(Rating, fill=Rating)) + geom_bar() + coord_flip()
ggplot(NetflixMOV, aes(Rating, fill = Rating)) + geom_bar() + coord_flip()
This plot depicts how the most popular Movie rating was TV-MA, as well.
It’s interesting how for both Movies and TV shows that TV-MA. According to Spectrum, TV-MA means that the content is intended for adults and may be unsuitable for children under 17. This might make sense that most of Netflix’s content is rated TV-MA because in a study done by Statistica it was shown that a lot of Netflix subscribers are above the age of 18.
Next, I wanted to look into the relevancy when it comes to time with Netflix’s selection. This can be done by using the release year provided by the data set, since it tells us the year the TV Show/Movie was released to the public. Were Movies that had been recently released on the streaming platform? Or did Netflix add older movies?
netflix_titles %>%
group_by(ReleaseYear) %>%
count(sort = TRUE) %>%
ggplot(aes(ReleaseYear,n)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
When looking at the content selection portrayed in this plot, it was found that Netflix’s selection comes from a lot of different years. It was evident that more recent years had a higher number of content. Now, I’ll look into the year that Netflix added each piece of content for TV Shows and Movies. This is out of pure curiosity to see which TV Show or Movie has been on Netflix the longest.
NetflixTV %>%
filter(!Added %in% 1900) %>%
filter(!Added %in% 2020) %>%
count(Added, sort = TRUE)
## # A tibble: 9 x 2
## Added n
## <dbl> <int>
## 1 2019 656
## 2 2018 430
## 3 2017 361
## 4 2016 185
## 5 2015 30
## 6 2021 29
## 7 2014 6
## 8 2013 5
## 9 2008 1
I needed to filter out the year 1900 and 2020 because it didn’t make sense that the data set included it, especially 2020 since this data focuses on Netflix in 2019, so 2020 had not even happened yet. This could be credited to messy data. Once 1900 was filtered out, I was able to see the number of TV Shows that came from each year. It was found that 656 TV Shows that were released in 2019 were on Netflix, proving that Netflix’s TV Show selection was very recent content.
NetflixTV %>%
filter(!Added %in% 1900) %>%
filter(!Added %in% 2020) %>%
count(Added, Title,sort = TRUE)
## # A tibble: 1,703 x 3
## Added Title n
## <dbl> <chr> <int>
## 1 2008 Dinner for Five 1
## 2 2013 Breaking Bad 1
## 3 2013 Gossip Girl 1
## 4 2013 Jack Taylor 1
## 5 2013 Russell Peters vs. the World 1
## 6 2013 The 4400 1
## 7 2014 Goosebumps 1
## 8 2014 Lilyhammer 1
## 9 2014 Pee-wee's Playhouse 1
## 10 2014 The Borgias 1
## # … with 1,693 more rows
This line of code shows that the TV Show that is the oldest piece on Netflix in their 2019 selection was called “Dinner for Five.” It was released in 2008.
NetflixMOV %>%
filter(!Added %in% 1900) %>%
filter(!Added %in% 2020) %>%
count(Added, sort = TRUE)
## # A tibble: 13 x 2
## Added n
## <dbl> <int>
## 1 2019 1497
## 2 2018 1255
## 3 2017 864
## 4 2016 258
## 5 2021 88
## 6 2015 58
## 7 2014 19
## 8 2011 13
## 9 2013 6
## 10 2012 3
## 11 2009 2
## 12 2008 1
## 13 2010 1
Similar to TV Shows, I filtered out 1900 and 2020. after doing this, it was found that 1,497 Movies were released in 2019. This also proved that Netflix’s Movie selection came from very recently released content.
NetflixMOV %>%
filter(!Added %in% 1900) %>%
filter(!Added %in% 2020) %>%
count(Added, Title,sort = TRUE)
## # A tibble: 4,065 x 3
## Added Title n
## <dbl> <chr> <int>
## 1 2008 To and From New York 1
## 2 2009 Just Another Love Story 1
## 3 2009 Splatter 1
## 4 2010 Mad Ron's Prevues from Hell 1
## 5 2011 A Stoning in Fulham County 1
## 6 2011 Adam: His Song Continues 1
## 7 2011 Even the Rain 1
## 8 2011 Hard Lessons 1
## 9 2011 In Defense of a Married Man 1
## 10 2011 Joseph: King of Dreams 1
## # … with 4,055 more rows
It was interesting to see that the movie “To and From New York” was the oldest movie piece on Netflix’s 2019 selection because it also came from the year 2008.
Overall, when looking at Netflix’s selection in terms of the year the content was released to the public. The overall conclusion can be made that in 2019 Netflix’s content came from more recent years then it did older years. Perhaps this is why Netflix has such a huge subscription rate - people are able to watch recent movies and TV Shows whenever and wherever they want.
It’s very important to note that Netflix has content on their streaming platform from a ton of different countries. So what country had the most produced content on Netflix? Let’s find out!
After creating a geo-spatial chart in Tableau depicting the countries and their number of produced TV shows and movies, it was very apparent that the United States took the lead for the most produced movies AND TV shows. The Tableau chart shows a map of the world for Movies (on top) and TV Shows (on bottom).
How different countries have their content on Netflix?
netflix_titles %>%
group_by(Country) %>%
count(Type, sort = TRUE)%>%
filter(!Country %in% NA)
## # A tibble: 131 x 3
## # Groups: Country [81]
## Country Type n
## <chr> <chr> <int>
## 1 United States Movie 2100
## 2 India Movie 883
## 3 United States TV Show 783
## 4 United Kingdom Movie 341
## 5 United Kingdom TV Show 236
## 6 Canada Movie 175
## 7 Japan TV Show 162
## 8 South Korea TV Show 152
## 9 France Movie 137
## 10 Spain Movie 119
## # … with 121 more rows
131 countries! This is definitely more than I thought, perhaps this is why Netflix has so many subscribers since they have a worldwide variety.
Looking at the information on a different plot, helps get a closer look at which countries were close to America in terms of production:
netflix_titles %>%
group_by(Country) %>%
count(Type, sort = TRUE) %>%
filter(!Country %in% NA) %>%
ggplot(aes(reorder(Country, n), n, fill = Type)) +
geom_col() +
coord_flip() +
facet_wrap(~Type, scales = "free_y")
This plot shows all of the countries and their number of production for both TV Shows and Movies. To make it more clear and easy to read, I filtered it to the top 50 countries:
netflix_titles %>%
group_by(Country) %>%
count(Type, sort = TRUE) %>%
filter(n > 50) %>%
filter(!Country %in% NA) %>%
ggplot(aes(reorder(Country, n), n, fill = Type)) +
geom_col() +
coord_flip() +
facet_wrap(~Type, scales = "free_y")
So much easier to read! This chart enables viewers to see how much more production the US did in comparison to the other countries.
This already proves what I was looking for - that the United States has the highest number of TV shows and Movies on Netflix- but I now want to dive deeper into the specifics of those American TV Shows and Movies.
For this portion of the project, I will now look into the TV Shows and Movies produced in America that were in Netflix’s selection in 2019. I’ll start by addressing the exact count number of American TV Shows and Movies, then look into the genres for each, and lastly look at the ratings. I’ll then compare these findings to the overall Netflix findings from above.
netflix_titles %>%
filter(Country %in% ("United States")& Type %in% ("TV Show")) ->USATV
USATV %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 783
It was found that there were 783 American produced TV Shows on the streaming platform. When looking back to the total number of TV Shows on Netflix in 2019, the count number was 2,410. 783 is 32.5% of 2,410… Meaning that 32.5% of the TV Show content on Netflix in 2019 was produced in America.
netflix_titles %>%
filter(Country %in% ("United States")& Type %in% ("Movie")) ->USAMOV
USAMOV%>%count()
## # A tibble: 1 x 1
## n
## <int>
## 1 2100
There were 2,100 American produced Movies on Netflix. When looking back to the total number of Movies on Netflix in 2019, the count number was 5,377. 2,100 is 39.05% of 5,377… Meaning that 39.05% of the Movies content on Netflix in 2019 was produced in America.
Referencing back to the TV Show Genres on Netflix overall, it was shown that International TV Shows was the most popular genre of the entire selection. What’s the most popular genre for the American TV Shows?
ggplot(USATV, aes(Genre, fill=Genre))+geom_bar()+ coord_flip()
Very interesting! The most popular genre among the USA produced TV Shows on Netflix was Kids’ TV. How many TV Shows on Netflix as a whole were categorized by Kids’ TV?:
netflix_titles %>%
filter(Type%in%"TV Show" & Genre%in% "Kids' TV") %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 359
Of the TV Shows selection as a whole, 359 shows were categorized under Kids’ TV. How many of these were American produced?:
USATV %>%
filter(Type%in%"TV Show" & Genre%in% "Kids' TV") %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 163
163 TV Shows that were produced in the United States were categorized under Kids’ TV. Proving that 45.4% of the Kids’ TV Genre was produced in the United States.
Referencing back to the Movies Genres on Netflix overall, it was shown that Dramas were the most popular genre of the entire selection. What’s the most popular genre for the American Movies?
ggplot(USAMOV, aes(Genre, fill=Genre))+geom_bar()+ coord_flip()
The most popular genre among the USA produced Movies was Documentaries. It’s noted that Dramas were a close second. How many Movies on Netflix as a whole were categorized by Dramas in 2019?:
netflix_titles %>%
filter(Type%in%"Movie" & Genre%in% "Documentaries") %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 751
Of the Movies selection as a whole, 751 movies were categorized under Documentaries. How many of these were American produced?:
USAMOV %>%
filter(Type%in%"Movie" & Genre%in% "Documentaries") %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 397
397 Movies that were produced in the United States were categorized as Documentaries. From this number, it can be said that 52.86% of the Documentaries Genre on Netflix in 2019 were produced in the United States.
The same steps will be applied to determine how much content in terms of the ratings are based off of American production. Looking back at the TV Show Ratings on Netflix overall, it was shown that TV-MA was the most popular rating of the entire TV Show selection. What’s the most popular rating for the American TV Shows?
ggplot(USATV, aes(Rating, fill=Rating))+geom_bar()+ coord_flip()
The most popular rating among the TV Shows produced in America was TV-MA. This doesn’t come as a surprise since the overall selection on Netflix for both TV Shows and even Movies was TV-MA, but what percentage of that overall selection was produced in America?
USATV %>%
filter(Type%in%"TV Show" & Rating%in% "TV-MA") %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 318
318 American TV Shows were rated TV-MA. What was the total number of TV Shows on Netflix that were rated TV-MA?
netflix_titles %>%
filter(Type%in%"TV Show" & Rating%in% "TV-MA") %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 1018
A huge number! 1,017 of the entire selection of TV Shows were rated TV-MA. From that number, 31.26% of the TV-MA TV Shows were produced in America.
Similar to the TV Show Ratings on Netflix overall, it was shown that TV-MA was the most popular rating of the entire Movie selection too. What’s the most popular rating among the American Movies?
ggplot(USAMOV, aes(Rating, fill=Rating))+geom_bar()+ coord_flip()
No surprises here! The most popular rating among the Movies produced in America was TV-MA. What percentage of that overall selection was produced in America?
USAMOV %>%
filter(Type%in%"Movie" & Rating%in% "TV-MA") %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 610
610 American Movies were rated TV-MA. What was the total number of Movies on Netflix that were rated TV-MA?
netflix_titles %>%
filter(Type%in%"Movie" & Rating%in% "TV-MA") %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 1845
1,845 of the entire selection of Movies on Netflix were rated TV-MA. What does this mean for American content? That 33.06% of that TV-MA rated Movies were produced in the United States.
How much percent of the relevant content on Netflix in 2019 was American produced?USATV %>%
filter(ReleaseYear%in%"2019") %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 163
There were 163 American produced TV Shows that were released in 2019 on Netflix in 2019. When looking back at the TV Shows overall, there were 656 shows produced in 2019. This means that of the TV Shows in 2019 on Netflix, only 24.84% of it was new content from America. What about for movies?
USAMOV %>%
filter(ReleaseYear%in%"2019") %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 224
There were 224 American produced Movies that were released in 2019 on Netflix in 2019. When looking back at the Movies overall, there were 1,497 movies produced in 2019. This means that of the Movies in 2019 on Netflix, only 14.96% of it was new content from America.
As I finish up my project, I’ll revert back to the beginning of the analysis At the beginning of my project, I stated my interest in America’s role on Netflix in terms of production quantity. Since America was the home country of Netflix’s origin, I wanted to look into the content that the United States produced. I looked into Netflix’s selection as a whole in terms of TV Show and Movie count and then dove into most popular genre, ratings, and release year for both TV Shows and Movies. After looking at the different countries Netflix had, it was evident that the United States had the most produced content on Netflix. In reference to the “Netflix in Different Countries” section of my project, it was very obvious that the United States had the highest number of TV shows and Movies on Netflix, totaling at 783 TV Shows and 2,100 Movies. I included the percentages that America took up when it came to most popular genre, ratings and release year for both TV Shows and Movies. One big find was that of the total Movies selection, 751 movies were categorized under Documentaries. 397 Documentary American Movies made up for 52.86% of the Documentaries Genre on Netflix in 2019. This proved that American content made up more than half of the Movie Documentary selection in 2019. As for the other categories, the USA percentage ranged from 14-50%, with majority of the categories falling into the 30th percentile. This is a very great percentage for America, especially with all of the other different countries to account for.
With further research and data, it would be interesting to view the data for Netflix in 2020 and see if there was a big difference between the year on the streaming service. Another interesting exploration would be analyzing the growth and decline of the sections I analyzed such as genres, ratings, and release year on Netflix over a number of years. It would be able to make a potential observation of the factors that go into how Netflix has added or gotten rid of certain pieces of content. Lastly, since I looked into the American produced content on Netflix, I would love to take it further and look into the reasoning behind why this could be. Is it due to the home court advantage? Price? Relevancy? There could be due to a ton of different factors, so it would make for an interesting analysis