This data is all about Movies That are available on Netflix Website movies title, cast of the movie,desc of movies, duration, rating on IMDB, voted by people, year, genre, certificate. This dataset comes from the IMDB website data is collected by using web scraping

Input Data

Make sure our data placed in the same folder our R project data.

movie <- read.csv("n_movies.csv")

Data Inspection

After we read data, we need to sampling to check the data properly.

Check top row data

head(movie)
##                    title        year certificate duration
## 1              Cobra Kai    (2018– )       TV-14   30 min
## 2              The Crown    (2016– )       TV-MA   58 min
## 3       Better Call Saul (2015–2022)       TV-MA   46 min
## 4          Devil in Ohio      (2022)       TV-MA  356 min
## 5 Cyberpunk: Edgerunners    (2022– )       TV-MA   24 min
## 6            The Sandman    (2022– )       TV-MA   45 min
##                          genre rating
## 1        Action, Comedy, Drama    8.5
## 2    Biography, Drama, History    8.7
## 3                 Crime, Drama    8.9
## 4       Drama, Horror, Mystery    5.9
## 5 Animation, Action, Adventure    8.6
## 6       Drama, Fantasy, Horror    7.8
##                                                                                                                                                                                                                      description
## 1                                                                        Decades after their 1984 All Valley Karate Tournament bout, a middle-aged Daniel LaRusso and Johnny Lawrence find themselves martial-arts rivals again.
## 2                                                                                 Follows the political rivalries and romance of Queen Elizabeth II's reign and the events that shaped the second half of the twentieth century.
## 3                                                                                                     The trials and tribulations of criminal lawyer Jimmy McGill before his fateful run-in with Walter White and Jesse Pinkman.
## 4                                                                          When a psychiatrist shelters a mysterious cult escapee, her world is turned upside down as the girl's arrival threatens to tear her own family apart.
## 5 A Street Kid trying to survive in a technology and body modification-obsessed city of the future. Having everything to lose, he chooses to stay alive by becoming an Edgerunner, a Mercenary outlaw also known as a Cyberpunk.
## 6                                                                                Upon escaping after decades of imprisonment by a mortal wizard, Dream, the personification of dreams, sets about to reclaim his lost equipment.
##                                                                              stars
## 1 ['Ralph Macchio, ', 'William Zabka, ', 'Courtney Henggeler, ', 'Xolo Maridueña']
## 2           ['Claire Foy, ', 'Olivia Colman, ', 'Imelda Staunton, ', 'Matt Smith']
## 3       ['Bob Odenkirk, ', 'Rhea Seehorn, ', 'Jonathan Banks, ', 'Patrick Fabian']
## 4   ['Emily Deschanel, ', 'Sam Jaeger, ', 'Gerardo Celasco, ', 'Madeleine Arthur']
## 5                 ['Zach Aguilar, ', 'Kenichiro Ohashi, ', 'Emi Lo, ', 'Aoi Yûki']
## 6 ['Tom Sturridge, ', 'Boyd Holbrook, ', 'Patton Oswalt, ', 'Vivienne Acheampong']
##     votes
## 1 177,031
## 2 199,885
## 3 501,384
## 4   9,773
## 5  15,413
## 6 116,358

Check bottom row data

tail(movie)
##                 title        year certificate duration
## 9952     Breaking Bad (2008–2013)       TV-MA   49 min
## 9953   The Imperfects    (2022– )       TV-MA   45 min
## 9954 The Walking Dead (2010–2022)       TV-MA   44 min
## 9955        The Crown    (2016– )       TV-MA   58 min
## 9956     Supernatural (2005–2020)       TV-14   44 min
## 9957    Devil in Ohio      (2022)       TV-MA  356 min
##                          genre rating
## 9952    Crime, Drama, Thriller    9.5
## 9953  Action, Adventure, Drama    6.3
## 9954   Drama, Horror, Thriller    8.1
## 9955 Biography, Drama, History    8.7
## 9956    Drama, Fantasy, Horror    8.4
## 9957    Drama, Horror, Mystery    5.9
##                                                                                                                                                                            description
## 9952                  A high school chemistry teacher diagnosed with inoperable lung cancer turns to manufacturing and selling methamphetamine in order to secure his family's future.
## 9953 After an experimental gene therapy turns them into monsters, three twenty-somethings band together to hunt down the scientist responsible and force him to make them human again.
## 9954                                                  Sheriff Deputy Rick Grimes wakes up from a coma to learn the world is in ruins and must lead a group of survivors to stay alive.
## 9955                                    Follows the political rivalries and romance of Queen Elizabeth II's reign and the events that shaped the second half of the twentieth century.
## 9956                Two brothers follow their father's footsteps as hunters, fighting evil supernatural beings of many kinds, including monsters, demons and gods that roam the earth.
## 9957                             When a psychiatrist shelters a mysterious cult escapee, her world is turned upside down as the girl's arrival threatens to tear her own family apart.
##                                                                                  stars
## 9952               ['Bryan Cranston, ', 'Aaron Paul, ', 'Anna Gunn, ', 'Betsy Brandt']
## 9953 ['Morgan Taylor Campbell, ', 'Italia Ricci, ', 'Rhianna Jagpal, ', 'Iñaki Godoy']
## 9954      ['Andrew Lincoln, ', 'Norman Reedus, ', 'Melissa McBride, ', 'Lauren Cohan']
## 9955            ['Claire Foy, ', 'Olivia Colman, ', 'Imelda Staunton, ', 'Matt Smith']
## 9956         ['Jared Padalecki, ', 'Jensen Ackles, ', 'Jim Beaver, ', 'Misha Collins']
## 9957    ['Emily Deschanel, ', 'Sam Jaeger, ', 'Gerardo Celasco, ', 'Madeleine Arthur']
##          votes
## 9952 1,831,359
## 9953     3,130
## 9954   970,067
## 9955   199,898
## 9956   439,601
## 9957     9,786

Check data dimension

dim(movie)
## [1] 9957    9

Check list of data column

names(movie)
## [1] "title"       "year"        "certificate" "duration"    "genre"      
## [6] "rating"      "description" "stars"       "votes"

Data Cleaning and Coertion

We need to understand the content of the data so we can structure it.

Data structuring

str(movie)
## 'data.frame':    9957 obs. of  9 variables:
##  $ title      : chr  "Cobra Kai" "The Crown" "Better Call Saul" "Devil in Ohio" ...
##  $ year       : chr  "(2018– )" "(2016– )" "(2015–2022)" "(2022)" ...
##  $ certificate: chr  "TV-14" "TV-MA" "TV-MA" "TV-MA" ...
##  $ duration   : chr  "30 min" "58 min" "46 min" "356 min" ...
##  $ genre      : chr  "Action, Comedy, Drama" "Biography, Drama, History" "Crime, Drama" "Drama, Horror, Mystery" ...
##  $ rating     : num  8.5 8.7 8.9 5.9 8.6 7.8 9.2 9.5 6.3 6.2 ...
##  $ description: chr  "Decades after their 1984 All Valley Karate Tournament bout, a middle-aged Daniel LaRusso and Johnny Lawrence fi"| __truncated__ "Follows the political rivalries and romance of Queen Elizabeth II's reign and the events that shaped the second"| __truncated__ "The trials and tribulations of criminal lawyer Jimmy McGill before his fateful run-in with Walter White and Jesse Pinkman." "When a psychiatrist shelters a mysterious cult escapee, her world is turned upside down as the girl's arrival t"| __truncated__ ...
##  $ stars      : chr  "['Ralph Macchio, ', 'William Zabka, ', 'Courtney Henggeler, ', 'Xolo Maridueña']" "['Claire Foy, ', 'Olivia Colman, ', 'Imelda Staunton, ', 'Matt Smith']" "['Bob Odenkirk, ', 'Rhea Seehorn, ', 'Jonathan Banks, ', 'Patrick Fabian']" "['Emily Deschanel, ', 'Sam Jaeger, ', 'Gerardo Celasco, ', 'Madeleine Arthur']" ...
##  $ votes      : chr  "177,031" "199,885" "501,384" "9,773" ...

After we see the structure of the data we need to clean the data so we can erase unnecassary data for our purpose.

Use subsetting library

library(gsubfn)
## Loading required package: proto

We need to make data in similar form so we can categorize the data better.

Subsetting the data

movie$votes <- gsub("[[:punct:]]", "", movie$votes)
movie$year <- gsub("I", "", movie$year)
movie$year <- gsub("[[:punct:]]", "", movie$year)
movie$year <- gsub(" ", "", movie$year)
movie$year <- substr(movie$year, 1, 4)

From the subsetting we will get a clear data about:

  1. When a netflix movie publish (in year column). The process include erasing unnecassary character and punctuation and taking the first 4 year character of the data.

  2. Make votes become numeric so it can become analyzed for our purpose.

After we subsetting we need to change the type of data so we can manipulate it.

Data coertion process

movie$year <- as.factor(movie$year)
movie$certificate <- as.factor (movie$certificate)
movie$duration <- as.factor(movie$duration)
movie$genre <- as.factor(movie$genre)
movie$stars <-as.factor(movie$stars)
movie$votes <- as.numeric(movie$votes)

str(movie)
## 'data.frame':    9957 obs. of  9 variables:
##  $ title      : chr  "Cobra Kai" "The Crown" "Better Call Saul" "Devil in Ohio" ...
##  $ year       : Factor w/ 100 levels "","1932","1933",..: 80 78 77 84 84 84 75 70 84 84 ...
##  $ certificate: Factor w/ 21 levels "","12","Approved",..: 14 16 16 16 16 16 16 16 16 8 ...
##  $ duration   : Factor w/ 292 levels "","1 min","10 min",..: 168 244 220 188 134 216 128 226 216 75 ...
##  $ genre      : Factor w/ 570 levels "","Action","Action, Adventure",..: 23 191 277 427 126 407 140 288 7 194 ...
##  $ rating     : num  8.5 8.7 8.9 5.9 8.6 7.8 9.2 9.5 6.3 6.2 ...
##  $ description: chr  "Decades after their 1984 All Valley Karate Tournament bout, a middle-aged Daniel LaRusso and Johnny Lawrence fi"| __truncated__ "Follows the political rivalries and romance of Queen Elizabeth II's reign and the events that shaped the second"| __truncated__ "The trials and tribulations of criminal lawyer Jimmy McGill before his fateful run-in with Walter White and Jesse Pinkman." "When a psychiatrist shelters a mysterious cult escapee, her world is turned upside down as the girl's arrival t"| __truncated__ ...
##  $ stars      : Factor w/ 8615 levels "['A. Salaam', '| ', '    Stars:', 'Shashi Kapoor, ', 'Sulakshana Pandit, ', 'Mehmood, ', 'Sudhir']",..: 6549 1595 1034 2311 8507 8000 4203 1232 5810 492 ...
##  $ votes      : num  177031 199885 501384 9773 15413 ...

From the new data structure we know that the data types are succesfully changed. For our analysis we need the data from 2011 to 2020 to see Netflix movie performance in that range of year.

Grouping the data into our data range

movie_new <- movie[movie$year %in% c("2011","2012","2013","2014","2015","2016","2017","2018","2019","2020"),]
movie_new$year <- droplevels(movie_new$year)

After we grouping our data, we will get our data from year 2011-2020 only. After we get our data in our data range, we need to erase unnecassary column for simpler analysis.

Subsetting unnecessary column

movie_new <- movie_new[,-c(3:5,7:8)]

We can compare the dimension our data before and after subsetting and cleaning by using the code below.

Data before

dim(movie)
## [1] 9957    9

Data after

dim(movie_new)
## [1] 6440    4

The result of data cleaning can be checked more detailed by checking the row and column of the top data.

Top row new dataset

head(movie_new)
##               title year rating   votes
## 1         Cobra Kai 2018    8.5  177031
## 2         The Crown 2016    8.7  199885
## 3  Better Call Saul 2015    8.9  501384
## 7    Rick and Morty 2013    9.2  502160
## 11  Stranger Things 2016    8.7 1149889
## 19   Peaky Blinders 2013    8.8  531058

Brief explaination

summary(movie_new)
##     title                year          rating          votes        
##  Length:6440        2019   :1316   Min.   :2.000   Min.   :      5  
##  Class :character   2020   :1209   1st Qu.:6.100   1st Qu.:    260  
##  Mode  :character   2018   :1011   Median :6.900   Median :   1139  
##                     2017   : 807   Mean   :6.782   Mean   :  15302  
##                     2016   : 614   3rd Qu.:7.600   3rd Qu.:   5073  
##                     2015   : 469   Max.   :9.900   Max.   :1149902  
##                     (Other):1014   NA's   :201     NA's   :201

From our data we know: 1. There are 6440 popular netflix movies from year 2011-2020. 2. Range of rating is between 2.0 to 9.9. 3. Range of votes is between 5 to 1,149,902.

Data Manipulation and Transformation

We need to get a clearer view about how many movie Netflix publish per year in 2011-2020. We will categorize the data based on year and total.

Data frame of year and total

movie_total <- as.data.frame(table(movie_new$year))
colnames(movie_total) <- c("Year", "Total")
movie_total
##    Year Total
## 1  2011   145
## 2  2012   213
## 3  2013   269
## 4  2014   387
## 5  2015   469
## 6  2016   614
## 7  2017   807
## 8  2018  1011
## 9  2019  1316
## 10 2020  1209

We can get the picture by plotting it into barplot. We can see the trend of popular Netflix movies published by seeing the graph below.

Barplot Total Movie per year in 2011-2020

graphics::barplot(xtabs(Total ~ Year, movie_total))

We need to get a clearer view about performance of movies produced in 2011-2020. The performance shown in total votes of movie. We will categorize the data based on year and votes.

Data frame of year and votes

movie_votes <- aggregate(x = votes ~ year,
          data = movie_new,
          FUN = sum)
movie_votes
##    year    votes
## 1  2011  7383244
## 2  2012  6084313
## 3  2013 11591278
## 4  2014  6985661
## 5  2015  8549435
## 6  2016 13779863
## 7  2017 12592004
## 8  2018  9325372
## 9  2019 11026520
## 10 2020  8149820

We can get the picture by plotting it into barplot. We can see the performance popular Netflix movies by seeing the graph below.

Barplot movies performance per year in 2011-2020

graphics::barplot(xtabs(votes ~ year, movie_votes))

We need to get a clearer view about performance of movies produced average per year. The performance shown in average votes of movie per year. We will categorize the data based on year and votes.

Data frame of year and votes

movie_votes_mean <- aggregate(x = votes ~ year,
          data = movie_new,
          FUN = mean)
movie_votes_mean
##    year     votes
## 1  2011 51631.077
## 2  2012 28835.607
## 3  2013 43413.026
## 4  2014 19244.245
## 5  2015 18425.506
## 6  2016 22776.633
## 7  2017 15878.946
## 8  2018  9400.577
## 9  2019  8567.615
## 10 2020  7315.817

We can get the picture by plotting it into barplot. We can see the average performance popular Netflix movies by seeing the graph below.

Barplot movies performance per year in 2011-2020

graphics::barplot(xtabs(votes ~ year, movie_votes_mean))

We need to get a clearer view about the average rating of popular netflix movies. We will categorize the data based on year and rating.

Data frame of year and rating

movie_rating <- aggregate(x = rating ~ year,
          data = movie_new,
          FUN = mean)
movie_rating
##    year   rating
## 1  2011 6.581818
## 2  2012 6.810427
## 3  2013 6.585768
## 4  2014 6.722590
## 5  2015 6.762931
## 6  2016 6.723802
## 7  2017 6.731021
## 8  2018 6.811895
## 9  2019 6.965113
## 10 2020 6.703591

We can get the picture by plotting it into barplot. We can see the average rating popular Netflix movies by seeing the graph below.

Barplot movies performance per year in 2011-2020

graphics::barplot(xtabs(rating ~ year, movie_rating))

  1. Which year that has most movie published in 2011-2020?
head(movie_total[order(movie_total$Total, decreasing = T),],1)
##   Year Total
## 9 2019  1316
  1. Which year that has least movie published in 2011-2020?
tail(movie_total[order(movie_total$Total, decreasing = T),],1)
##   Year Total
## 1 2011   145
  1. Which movie that has most votes in 2011-2020?
head(movie_new[order(movie_new$votes, decreasing = T),],1)
##                title year rating   votes
## 9949 Stranger Things 2016    8.7 1149902
  1. Which movie that has highest rating in 2011-2020?
head(movie_new[order(movie_new$rating, decreasing = T),],1)
##                title year rating votes
## 9445 BoJack Horseman 2014    9.9 16066
  1. Which published year that has highest average rating per year in 2011-2020?
head(movie_rating[order(movie_rating$rating, decreasing = T),],1)
##   year   rating
## 9 2019 6.965113
  1. Which published year that has lowest average rating per year in 2011-2020?
tail(movie_rating[order(movie_rating$rating, decreasing = T),],1)
##   year   rating
## 1 2011 6.581818
  1. Which publsihed year that has highest votes in 2011-2020?
head(movie_votes[order(movie_votes$votes, decreasing = T),],1)
##   year    votes
## 6 2016 13779863
  1. Which publsihed year that has lowest votes in 2011-2020?
tail(movie_votes[order(movie_votes$votes, decreasing = T),],1)
##   year   votes
## 2 2012 6084313

Explanatory Text and Business Recomendation

Explanatory Text From data manipulation and transformation above we understand some insight that can give some conclusion in Netflix popular movie. The insights are:

1. The amount of popular movie Netflix published tend to increase in every year.

2. The more Netflix publish movie doesn’t mean the votes will increase more. Means the market cap of Netflix movie already reach at the highest.

3. The average votes per movie tend to decrease means increasing movie means not increasing votes overall. The votes distributed widely in many movie but the performance per movie is decreasing.

4. The highest votes movie don’t have the highest rating and the highest rating don’t have the highest votes. The value of votes weigh more in calculate the performance of the movie. The other proof is that the average rating per year don’t change too much but the total votes per year is changing more volatile. The rating doesn’t reflect the performance Netflix movie too much.

Recommendation

From the insight above we can give some business recomendation based on the data. The recommendations are:

1. If Netflix keep increasing published movie, Netflix need a contract that paid based on views or votes of the movie. It means decreasing the upfront payment and allocating more in payment per view or votes.

2. Netflix should figure out the formula to make a movie that can increase the performance of the votes. It means Netflix should focus on votes per movie or views per movie. Increasing quality of the movie will help to increase the votes.

3. The marketcap of Netflix is at the peak, then Netflix should think a strategy to diversify its brand or product so that as a company Netflix can perform better.