## Warning: package 'plotly' was built under R version 4.2.2
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ readr   2.1.2     ✔ stringr 1.4.0
## ✔ purrr   0.3.4     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks plotly::filter(), stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'lubridate'
## 
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Task 1

Data preparation process

Society is changing as a result of big data. Each firm wants to incorporate this new trend into its processes. It makes a number of promises that a business can take advantage of. The use of analytics to big data can help businesses find solutions to a variety of data-dependent management and operational issues that affect both enterprises and the IT teams who support them. However, a number of potential consumers are unsure about what big data analytics entails and how to use it to their advantage. Confusion results, which prompts questions about the appropriate course of action. The absence of understanding presents a problem for businesses looking to boost profitability and gain a competitive edge. The companies must seek guidance from credible sources in order to make the best choice. In order to synthesize the data and uncover enormous hidden values from datasets, big data analytics presents an enormous difficulty in the construction of highly scalable algorithms and structures. Possible innovations include new big data apps and algorithms that extract crucial and hidden knowledge. This essay perfectly captures a key aspect of big data analytics within an organization. The goal of statistical analysis is to find similarities or patterns in data that will help decision-makers at all levels make informed choices.

Data collection

The project’s dataset was obtained from Kaggle.com. The collection includes metadata for Netflix’s movies and TV episodes, such as the titles’ initial release dates, the date they were introduced to Netflix, as well as information about the actors and directors. These titles provide a comprehensive picture when combined with additional factors like age rating, country of production, duration, and descriptions. In essence, the whole Netflix database is represented by the 8,807 records in this collection.

Importing the data from excel into R

##   show_id    type                 title        director
## 1      s1   Movie  Dick Johnson Is Dead Kirsten Johnson
## 2      s2 TV Show         Blood & Water            <NA>
## 3      s3 TV Show             Ganglands Julien Leclercq
## 4      s4 TV Show Jailbirds New Orleans            <NA>
## 5      s5 TV Show          Kota Factory            <NA>
## 6      s6 TV Show         Midnight Mass   Mike Flanagan
##                                                                                                                                                                                                                                                                                                              cast
## 1                                                                                                                                                                                                                                                                                                            <NA>
## 2 Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng
## 3                                                                                                                                                             Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, Geert Van Rampelberg, Bakary Diombera
## 4                                                                                                                                                                                                                                                                                                            <NA>
## 5                                                                                                                                                                                                        Mayur More, Jitendra Kumar, Ranjan Raj, Alam Khan, Ahsaas Channa, Revathi Pillai, Urvi Singh, Arun Kumar
## 6                                                                        Kate Siegel, Zach Gilford, Hamish Linklater, Henry Thomas, Kristin Lehman, Samantha Sloyan, Igby Rigney, Rahul Kohli, Annarah Cymone, Annabeth Gish, Alex Essoe, Rahul Abburi, Matt Biedel, Michael Trucco, Crystal Balint, Louis Oliver
##         country         date_added release_year rating  duration
## 1 United States September 25, 2021         2020  PG-13    90 min
## 2  South Africa September 24, 2021         2021  TV-MA 2 Seasons
## 3          <NA> September 24, 2021         2021  TV-MA  1 Season
## 4          <NA> September 24, 2021         2021  TV-MA  1 Season
## 5         India September 24, 2021         2021  TV-MA 2 Seasons
## 6          <NA> September 24, 2021         2021  TV-MA  1 Season
##                                                       listed_in
## 1                                                 Documentaries
## 2               International TV Shows, TV Dramas, TV Mysteries
## 3 Crime TV Shows, International TV Shows, TV Action & Adventure
## 4                                        Docuseries, Reality TV
## 5        International TV Shows, Romantic TV Shows, TV Comedies
## 6                            TV Dramas, TV Horror, TV Mysteries
##                                                                                                                                                description
## 1 As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.
## 2      After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth.
## 3       To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war.
## 4      Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Orleans on this gritty reality series.
## 5 In a city of coaching centers known to train India’s finest collegiate minds, an earnest but unexceptional student and his friends navigate campus life.
## 6 The arrival of a charismatic young priest brings glorious miracles, ominous mysteries and renewed religious fervor to a dying town desperate to believe.

Descriptive analysis

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
str(netflix)
## 'data.frame':    8807 obs. of  12 variables:
##  $ show_id     : chr  "s1" "s2" "s3" "s4" ...
##  $ type        : chr  "Movie" "TV Show" "TV Show" "TV Show" ...
##  $ title       : chr  "Dick Johnson Is Dead" "Blood & Water" "Ganglands" "Jailbirds New Orleans" ...
##  $ director    : chr  "Kirsten Johnson" NA "Julien Leclercq" NA ...
##  $ cast        : chr  NA "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile "| __truncated__ "Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, G"| __truncated__ NA ...
##  $ country     : chr  "United States" "South Africa" NA NA ...
##  $ date_added  : chr  "September 25, 2021" "September 24, 2021" "September 24, 2021" "September 24, 2021" ...
##  $ release_year: int  2020 2021 2021 2021 2021 2021 2021 1993 2021 2021 ...
##  $ rating      : chr  "PG-13" "TV-MA" "TV-MA" "TV-MA" ...
##  $ duration    : chr  "90 min" "2 Seasons" "1 Season" "1 Season" ...
##  $ listed_in   : chr  "Documentaries" "International TV Shows, TV Dramas, TV Mysteries" "Crime TV Shows, International TV Shows, TV Action & Adventure" "Docuseries, Reality TV" ...
##  $ description : chr  "As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical wa"| __truncated__ "After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is h"| __truncated__ "To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled "| __truncated__ "Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Or"| __truncated__ ...
describe(netflix)
##              vars    n    mean      sd median trimmed     mad  min  max range
## show_id*        1 8807 4404.00 2542.51 4404.0 4404.00 3264.69    1 8807  8806
## type*           2 8807    1.30    0.46    1.0    1.25    0.00    1    2     1
## title*          3 8807 4404.00 2542.51 4404.0 4404.00 3264.69    1 8807  8806
## director*       4 6173 2318.15 1307.98 2362.0 2328.84 1670.89    1 4528  4527
## cast*           5 7982 3846.82 2221.85 3838.0 3845.87 2866.61    1 7692  7691
## country*        6 7976  427.90  193.43  493.0  447.16  164.57    1  748   747
## date_added*     7 8797  899.55  497.26  914.0  901.28  649.38    1 1767  1766
## release_year    8 8807 2014.18    8.82 2017.0 2016.03    2.97 1925 2021    96
## rating*         9 8803   11.01    1.96   12.0   11.09    2.97    1   17    16
## duration*      10 8804   94.75   88.18   55.5   91.18   80.80    1  220   219
## listed_in*     11 8807  273.40  131.06  290.0  278.41  131.95    1  514   513
## description*   12 8807 4386.75 2532.81 4386.0 4386.36 3249.86    1 8775  8774
##               skew kurtosis    se
## show_id*      0.00    -1.20 27.09
## type*         0.85    -1.27  0.00
## title*        0.00    -1.20 27.09
## director*    -0.06    -1.20 16.65
## cast*         0.00    -1.21 24.87
## country*     -0.54    -0.98  2.17
## date_added*  -0.03    -1.20  5.30
## release_year -3.45    16.22  0.09
## rating*      -0.42     0.38  0.02
## duration*     0.25    -1.69  0.94
## listed_in*   -0.31    -0.76  1.40
## description*  0.00    -1.20 26.99

The dataset contains 8807 observations of the 12 variables listed below that describe the television programs and films:
show_id - Unique ID for every Movie / Tv Show
type - Identifier - A Movie or TV Show
title - Title of the Movie / Tv Show
director - Director of the Movie
cast - Actors involved in the movie / show
country - Country where the movie / show was produced
date_added - Date it was added on Netflix
release_year - Actual Release year of the move / show
rating - TV Rating of the movie / show
duration - Total Duration - in minute or number of seasons
listed_in - Genere
description - The summary description

Data cleaning

We can first purge the dataset of variables that are not useful. It is a show id varaible in our case. The description variable won’t be used for the exploratory data analysis, but it can be used to locate similar movies and TV series using the text similarities in the further analysis, which is outside the scope of this study.Several variables in the dataset must be changed in order to accomplish the aim because the project’s main focus is on the analytics of the Netflix database. R Studio was used to carry out this data cleansing. The R code that was utilized is included below for curious readers to examine. Data cleaning resulted in the removal of unnecessary columns from the dashboard, and unification of the data structure to enable future processing. The missing value and erroneous variable order issues that came up when using Excel have also been appropriately fixed as shown below.

We check if we have missing values in the dataset.

##        variable missing.values.count
## 1          type                    0
## 2         title                    0
## 3      director                 2634
## 4          cast                  825
## 5       country                  831
## 6    date_added                   10
## 7  release_year                    0
## 8        rating                    4
## 9      duration                    3
## 10    listed_in                    0
## 11  description                    0

The output shown above reveals that the variables director, cast, nation, data added, and rating all have missing values. Since there are 14 levels in the categorical variable “rating,” we can use a mode to roughly fill in the missing values for rating.

The data added variable’s date format can be altered to make future manipulations simpler.

We won’t fill in the director, cast, nation, and date added variables for the time being because their missing values are difficult to approximate. At the time when it is necessary, we will remove the missing values. In accordance with the title, country, type, and release year variables, we additionally remove duplicate rows from the dataset.

After finishing the data cleaning procedures, we can move on to data exploration.

Task 3

Data visualization

amount_by_type <- netflix %>% group_by(type) %>% summarise(
  count = n()
)
diagram1 <- plot_ly(amount_by_type, labels = ~type, values = ~count, type = 'pie', marker = list(colors = c("#bd3939", "#399ba3")))
diagram1 <- diagram1 %>% layout(title = 'Amount Of Netflix Content By Type',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
diagram1

As can be seen from the graph above, Netflix has more than twice as many movies as TV shows.
Since many films and television programs are created by multiple nations (country variable), we must divide strings inside the country variable and count the total occurrence of each nation separately in order to accurately calculate the total amount of material generated by each nation.

## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.
## Selecting by count.TV Show

We can see that when it comes to Netflix content, the U.s is by far the leader. On Netflix, there are more TV shows than movies from nations like Japan, South Korea, and Taiwan.

## `summarise()` has grouped output by 'date_added'. You can override using the
## `.groups` argument.

From above, it is clear that beginning in 2016, the total volume of content increased significantly. We also observe how quickly Netflix’s movie selection surpassed its TV show selection.

From the information shown above, it is clear that Netflix’s addition of content peaked in November 2019. Let’s see how the content is distributed throughout the various rating classes.

## `summarise()` has grouped output by 'rating'. You can override using the
## `.groups` argument.
## Selecting by count.TV Show
## Selecting by count
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

As observed from the graph above, Indian movies are often the longest on average, clocking in at 127 minutes.

Top 20 directors according to Netflix content.

## Selecting by count
##               director count
## 1        Rajiv Chilaka    22
## 2            Jan Suter    21
## 3          Raúl Campos    19
## 4         Marcus Raboy    16
## 5          Suhas Kadav    16
## 6            Jay Karas    15
## 7  Cathy Garcia-Molina    13
## 8          Jay Chapman    12
## 9      Martin Scorsese    12
## 10     Youssef Chahine    12
## 11    Steven Spielberg    11
## 12    Don Michael Paul    10
## 13      Anurag Kashyap     9
## 14        David Dhawan     9
## 15     Shannon Hartman     9
## 16      Yılmaz Erdoğan     9
## 17     Fernando Ayllón     8
## 18         Hakan Algül     8
## 19    Hanung Bramantyo     8
## 20          Johnnie To     8
## 21      Justin G. Dyck     8
## 22      Kunle Afolayan     8
## 23         Lance Bangs     8
## 24   Quentin Tarantino     8
## 25    Robert Rodriguez     8
## 26         Ryan Polito     8
## 27         Troy Miller     8
## 28         Umesh Mehra     8

The top 20 actors on Netflix in American-made movies according to the volume of content:

## Selecting by count
##                actor count
## 1       Adam Sandler    20
## 2  Samuel L. Jackson    19
## 3    Fred Tatasciore    17
## 4      Molly Shannon    15
## 5         Seth Rogen    15
## 6         Chris Rock    14
## 7    Erin Fitzgerald    14
## 8       Laura Bailey    14
## 9     Morgan Freeman    14
## 10      Nicolas Cage    14
## 11      Dennis Quaid    13
## 12      James Franco    13
## 13   Woody Harrelson    13
## 14       Danny Trejo    12
## 15    David Koechner    12
## 16      Fred Armisen    12
## 17      Kate Higgins    12
## 18       Keith David    12
## 19         Mike Epps    12
## 20     Nick Swardson    12
## 21        Sean Astin    12
## 22        Will Smith    12