## Warning: package 'plotly' was built under R version 4.2.2
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ readr 2.1.2 ✔ stringr 1.4.0
## ✔ purrr 0.3.4 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks plotly::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
##
## Attaching package: 'lubridate'
##
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
Society is changing as a result of big data. Each firm wants to incorporate this new trend into its processes. It makes a number of promises that a business can take advantage of. The use of analytics to big data can help businesses find solutions to a variety of data-dependent management and operational issues that affect both enterprises and the IT teams who support them. However, a number of potential consumers are unsure about what big data analytics entails and how to use it to their advantage. Confusion results, which prompts questions about the appropriate course of action. The absence of understanding presents a problem for businesses looking to boost profitability and gain a competitive edge. The companies must seek guidance from credible sources in order to make the best choice. In order to synthesize the data and uncover enormous hidden values from datasets, big data analytics presents an enormous difficulty in the construction of highly scalable algorithms and structures. Possible innovations include new big data apps and algorithms that extract crucial and hidden knowledge. This essay perfectly captures a key aspect of big data analytics within an organization. The goal of statistical analysis is to find similarities or patterns in data that will help decision-makers at all levels make informed choices.
The project’s dataset was obtained from Kaggle.com. The collection includes metadata for Netflix’s movies and TV episodes, such as the titles’ initial release dates, the date they were introduced to Netflix, as well as information about the actors and directors. These titles provide a comprehensive picture when combined with additional factors like age rating, country of production, duration, and descriptions. In essence, the whole Netflix database is represented by the 8,807 records in this collection.
Importing the data from excel into R
## show_id type title director
## 1 s1 Movie Dick Johnson Is Dead Kirsten Johnson
## 2 s2 TV Show Blood & Water <NA>
## 3 s3 TV Show Ganglands Julien Leclercq
## 4 s4 TV Show Jailbirds New Orleans <NA>
## 5 s5 TV Show Kota Factory <NA>
## 6 s6 TV Show Midnight Mass Mike Flanagan
## cast
## 1 <NA>
## 2 Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng
## 3 Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, Geert Van Rampelberg, Bakary Diombera
## 4 <NA>
## 5 Mayur More, Jitendra Kumar, Ranjan Raj, Alam Khan, Ahsaas Channa, Revathi Pillai, Urvi Singh, Arun Kumar
## 6 Kate Siegel, Zach Gilford, Hamish Linklater, Henry Thomas, Kristin Lehman, Samantha Sloyan, Igby Rigney, Rahul Kohli, Annarah Cymone, Annabeth Gish, Alex Essoe, Rahul Abburi, Matt Biedel, Michael Trucco, Crystal Balint, Louis Oliver
## country date_added release_year rating duration
## 1 United States September 25, 2021 2020 PG-13 90 min
## 2 South Africa September 24, 2021 2021 TV-MA 2 Seasons
## 3 <NA> September 24, 2021 2021 TV-MA 1 Season
## 4 <NA> September 24, 2021 2021 TV-MA 1 Season
## 5 India September 24, 2021 2021 TV-MA 2 Seasons
## 6 <NA> September 24, 2021 2021 TV-MA 1 Season
## listed_in
## 1 Documentaries
## 2 International TV Shows, TV Dramas, TV Mysteries
## 3 Crime TV Shows, International TV Shows, TV Action & Adventure
## 4 Docuseries, Reality TV
## 5 International TV Shows, Romantic TV Shows, TV Comedies
## 6 TV Dramas, TV Horror, TV Mysteries
## description
## 1 As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.
## 2 After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth.
## 3 To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war.
## 4 Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Orleans on this gritty reality series.
## 5 In a city of coaching centers known to train India’s finest collegiate minds, an earnest but unexceptional student and his friends navigate campus life.
## 6 The arrival of a charismatic young priest brings glorious miracles, ominous mysteries and renewed religious fervor to a dying town desperate to believe.
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
str(netflix)
## 'data.frame': 8807 obs. of 12 variables:
## $ show_id : chr "s1" "s2" "s3" "s4" ...
## $ type : chr "Movie" "TV Show" "TV Show" "TV Show" ...
## $ title : chr "Dick Johnson Is Dead" "Blood & Water" "Ganglands" "Jailbirds New Orleans" ...
## $ director : chr "Kirsten Johnson" NA "Julien Leclercq" NA ...
## $ cast : chr NA "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile "| __truncated__ "Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, G"| __truncated__ NA ...
## $ country : chr "United States" "South Africa" NA NA ...
## $ date_added : chr "September 25, 2021" "September 24, 2021" "September 24, 2021" "September 24, 2021" ...
## $ release_year: int 2020 2021 2021 2021 2021 2021 2021 1993 2021 2021 ...
## $ rating : chr "PG-13" "TV-MA" "TV-MA" "TV-MA" ...
## $ duration : chr "90 min" "2 Seasons" "1 Season" "1 Season" ...
## $ listed_in : chr "Documentaries" "International TV Shows, TV Dramas, TV Mysteries" "Crime TV Shows, International TV Shows, TV Action & Adventure" "Docuseries, Reality TV" ...
## $ description : chr "As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical wa"| __truncated__ "After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is h"| __truncated__ "To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled "| __truncated__ "Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Or"| __truncated__ ...
describe(netflix)
## vars n mean sd median trimmed mad min max range
## show_id* 1 8807 4404.00 2542.51 4404.0 4404.00 3264.69 1 8807 8806
## type* 2 8807 1.30 0.46 1.0 1.25 0.00 1 2 1
## title* 3 8807 4404.00 2542.51 4404.0 4404.00 3264.69 1 8807 8806
## director* 4 6173 2318.15 1307.98 2362.0 2328.84 1670.89 1 4528 4527
## cast* 5 7982 3846.82 2221.85 3838.0 3845.87 2866.61 1 7692 7691
## country* 6 7976 427.90 193.43 493.0 447.16 164.57 1 748 747
## date_added* 7 8797 899.55 497.26 914.0 901.28 649.38 1 1767 1766
## release_year 8 8807 2014.18 8.82 2017.0 2016.03 2.97 1925 2021 96
## rating* 9 8803 11.01 1.96 12.0 11.09 2.97 1 17 16
## duration* 10 8804 94.75 88.18 55.5 91.18 80.80 1 220 219
## listed_in* 11 8807 273.40 131.06 290.0 278.41 131.95 1 514 513
## description* 12 8807 4386.75 2532.81 4386.0 4386.36 3249.86 1 8775 8774
## skew kurtosis se
## show_id* 0.00 -1.20 27.09
## type* 0.85 -1.27 0.00
## title* 0.00 -1.20 27.09
## director* -0.06 -1.20 16.65
## cast* 0.00 -1.21 24.87
## country* -0.54 -0.98 2.17
## date_added* -0.03 -1.20 5.30
## release_year -3.45 16.22 0.09
## rating* -0.42 0.38 0.02
## duration* 0.25 -1.69 0.94
## listed_in* -0.31 -0.76 1.40
## description* 0.00 -1.20 26.99
The dataset contains 8807 observations of the 12 variables listed
below that describe the television programs and films:
show_id - Unique ID for every Movie / Tv Show
type - Identifier - A Movie or TV Show
title - Title of the Movie / Tv Show
director - Director of the Movie
cast - Actors involved in the movie / show
country - Country where the movie / show was produced
date_added - Date it was added on Netflix
release_year - Actual Release year of the move / show
rating - TV Rating of the movie / show
duration - Total Duration - in minute or number of seasons
listed_in - Genere
description - The summary description
We can first purge the dataset of variables that are not useful. It is a show id varaible in our case. The description variable won’t be used for the exploratory data analysis, but it can be used to locate similar movies and TV series using the text similarities in the further analysis, which is outside the scope of this study.Several variables in the dataset must be changed in order to accomplish the aim because the project’s main focus is on the analytics of the Netflix database. R Studio was used to carry out this data cleansing. The R code that was utilized is included below for curious readers to examine. Data cleaning resulted in the removal of unnecessary columns from the dashboard, and unification of the data structure to enable future processing. The missing value and erroneous variable order issues that came up when using Excel have also been appropriately fixed as shown below.
We check if we have missing values in the dataset.
## variable missing.values.count
## 1 type 0
## 2 title 0
## 3 director 2634
## 4 cast 825
## 5 country 831
## 6 date_added 10
## 7 release_year 0
## 8 rating 4
## 9 duration 3
## 10 listed_in 0
## 11 description 0
The output shown above reveals that the variables director, cast, nation, data added, and rating all have missing values. Since there are 14 levels in the categorical variable “rating,” we can use a mode to roughly fill in the missing values for rating.
The data added variable’s date format can be altered to make future manipulations simpler.
We won’t fill in the director, cast, nation, and date added variables for the time being because their missing values are difficult to approximate. At the time when it is necessary, we will remove the missing values. In accordance with the title, country, type, and release year variables, we additionally remove duplicate rows from the dataset.
After finishing the data cleaning procedures, we can move on to data exploration.
amount_by_type <- netflix %>% group_by(type) %>% summarise(
count = n()
)
diagram1 <- plot_ly(amount_by_type, labels = ~type, values = ~count, type = 'pie', marker = list(colors = c("#bd3939", "#399ba3")))
diagram1 <- diagram1 %>% layout(title = 'Amount Of Netflix Content By Type',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
diagram1
As can be seen from the graph above, Netflix has more than twice as
many movies as TV shows.
Since many films and television programs are created by multiple nations
(country variable), we must divide strings inside the country variable
and count the total occurrence of each nation separately in order to
accurately calculate the total amount of material generated by each
nation.
## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.
## Selecting by count.TV Show
We can see that when it comes to Netflix content, the U.s is by far the leader. On Netflix, there are more TV shows than movies from nations like Japan, South Korea, and Taiwan.
## `summarise()` has grouped output by 'date_added'. You can override using the
## `.groups` argument.
From above, it is clear that beginning in 2016, the total volume of content increased significantly. We also observe how quickly Netflix’s movie selection surpassed its TV show selection.
From the information shown above, it is clear that Netflix’s addition of content peaked in November 2019. Let’s see how the content is distributed throughout the various rating classes.
## `summarise()` has grouped output by 'rating'. You can override using the
## `.groups` argument.
## Selecting by count.TV Show
## Selecting by count
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
As observed from the graph above, Indian movies are often the longest on average, clocking in at 127 minutes.
Top 20 directors according to Netflix content.
## Selecting by count
## director count
## 1 Rajiv Chilaka 22
## 2 Jan Suter 21
## 3 Raúl Campos 19
## 4 Marcus Raboy 16
## 5 Suhas Kadav 16
## 6 Jay Karas 15
## 7 Cathy Garcia-Molina 13
## 8 Jay Chapman 12
## 9 Martin Scorsese 12
## 10 Youssef Chahine 12
## 11 Steven Spielberg 11
## 12 Don Michael Paul 10
## 13 Anurag Kashyap 9
## 14 David Dhawan 9
## 15 Shannon Hartman 9
## 16 Yılmaz Erdoğan 9
## 17 Fernando Ayllón 8
## 18 Hakan Algül 8
## 19 Hanung Bramantyo 8
## 20 Johnnie To 8
## 21 Justin G. Dyck 8
## 22 Kunle Afolayan 8
## 23 Lance Bangs 8
## 24 Quentin Tarantino 8
## 25 Robert Rodriguez 8
## 26 Ryan Polito 8
## 27 Troy Miller 8
## 28 Umesh Mehra 8
The top 20 actors on Netflix in American-made movies according to the volume of content:
## Selecting by count
## actor count
## 1 Adam Sandler 20
## 2 Samuel L. Jackson 19
## 3 Fred Tatasciore 17
## 4 Molly Shannon 15
## 5 Seth Rogen 15
## 6 Chris Rock 14
## 7 Erin Fitzgerald 14
## 8 Laura Bailey 14
## 9 Morgan Freeman 14
## 10 Nicolas Cage 14
## 11 Dennis Quaid 13
## 12 James Franco 13
## 13 Woody Harrelson 13
## 14 Danny Trejo 12
## 15 David Koechner 12
## 16 Fred Armisen 12
## 17 Kate Higgins 12
## 18 Keith David 12
## 19 Mike Epps 12
## 20 Nick Swardson 12
## 21 Sean Astin 12
## 22 Will Smith 12