This data is all about Movies That are available on Netflix Website movies title, cast of the movie,desc of movies, duration, rating on IMDB, voted by people, year, genre, certificate. This dataset comes from the IMDB website data is collected by using web scraping
Make sure our data placed in the same folder our R project data.
movie <- read.csv("n_movies.csv")
After we read data, we need to sampling to check the data properly.
Check top row data
head(movie)
## title year certificate duration
## 1 Cobra Kai (2018– ) TV-14 30 min
## 2 The Crown (2016– ) TV-MA 58 min
## 3 Better Call Saul (2015–2022) TV-MA 46 min
## 4 Devil in Ohio (2022) TV-MA 356 min
## 5 Cyberpunk: Edgerunners (2022– ) TV-MA 24 min
## 6 The Sandman (2022– ) TV-MA 45 min
## genre rating
## 1 Action, Comedy, Drama 8.5
## 2 Biography, Drama, History 8.7
## 3 Crime, Drama 8.9
## 4 Drama, Horror, Mystery 5.9
## 5 Animation, Action, Adventure 8.6
## 6 Drama, Fantasy, Horror 7.8
## description
## 1 Decades after their 1984 All Valley Karate Tournament bout, a middle-aged Daniel LaRusso and Johnny Lawrence find themselves martial-arts rivals again.
## 2 Follows the political rivalries and romance of Queen Elizabeth II's reign and the events that shaped the second half of the twentieth century.
## 3 The trials and tribulations of criminal lawyer Jimmy McGill before his fateful run-in with Walter White and Jesse Pinkman.
## 4 When a psychiatrist shelters a mysterious cult escapee, her world is turned upside down as the girl's arrival threatens to tear her own family apart.
## 5 A Street Kid trying to survive in a technology and body modification-obsessed city of the future. Having everything to lose, he chooses to stay alive by becoming an Edgerunner, a Mercenary outlaw also known as a Cyberpunk.
## 6 Upon escaping after decades of imprisonment by a mortal wizard, Dream, the personification of dreams, sets about to reclaim his lost equipment.
## stars
## 1 ['Ralph Macchio, ', 'William Zabka, ', 'Courtney Henggeler, ', 'Xolo Maridueña']
## 2 ['Claire Foy, ', 'Olivia Colman, ', 'Imelda Staunton, ', 'Matt Smith']
## 3 ['Bob Odenkirk, ', 'Rhea Seehorn, ', 'Jonathan Banks, ', 'Patrick Fabian']
## 4 ['Emily Deschanel, ', 'Sam Jaeger, ', 'Gerardo Celasco, ', 'Madeleine Arthur']
## 5 ['Zach Aguilar, ', 'Kenichiro Ohashi, ', 'Emi Lo, ', 'Aoi Yûki']
## 6 ['Tom Sturridge, ', 'Boyd Holbrook, ', 'Patton Oswalt, ', 'Vivienne Acheampong']
## votes
## 1 177,031
## 2 199,885
## 3 501,384
## 4 9,773
## 5 15,413
## 6 116,358
Check bottom row data
tail(movie)
## title year certificate duration
## 9952 Breaking Bad (2008–2013) TV-MA 49 min
## 9953 The Imperfects (2022– ) TV-MA 45 min
## 9954 The Walking Dead (2010–2022) TV-MA 44 min
## 9955 The Crown (2016– ) TV-MA 58 min
## 9956 Supernatural (2005–2020) TV-14 44 min
## 9957 Devil in Ohio (2022) TV-MA 356 min
## genre rating
## 9952 Crime, Drama, Thriller 9.5
## 9953 Action, Adventure, Drama 6.3
## 9954 Drama, Horror, Thriller 8.1
## 9955 Biography, Drama, History 8.7
## 9956 Drama, Fantasy, Horror 8.4
## 9957 Drama, Horror, Mystery 5.9
## description
## 9952 A high school chemistry teacher diagnosed with inoperable lung cancer turns to manufacturing and selling methamphetamine in order to secure his family's future.
## 9953 After an experimental gene therapy turns them into monsters, three twenty-somethings band together to hunt down the scientist responsible and force him to make them human again.
## 9954 Sheriff Deputy Rick Grimes wakes up from a coma to learn the world is in ruins and must lead a group of survivors to stay alive.
## 9955 Follows the political rivalries and romance of Queen Elizabeth II's reign and the events that shaped the second half of the twentieth century.
## 9956 Two brothers follow their father's footsteps as hunters, fighting evil supernatural beings of many kinds, including monsters, demons and gods that roam the earth.
## 9957 When a psychiatrist shelters a mysterious cult escapee, her world is turned upside down as the girl's arrival threatens to tear her own family apart.
## stars
## 9952 ['Bryan Cranston, ', 'Aaron Paul, ', 'Anna Gunn, ', 'Betsy Brandt']
## 9953 ['Morgan Taylor Campbell, ', 'Italia Ricci, ', 'Rhianna Jagpal, ', 'Iñaki Godoy']
## 9954 ['Andrew Lincoln, ', 'Norman Reedus, ', 'Melissa McBride, ', 'Lauren Cohan']
## 9955 ['Claire Foy, ', 'Olivia Colman, ', 'Imelda Staunton, ', 'Matt Smith']
## 9956 ['Jared Padalecki, ', 'Jensen Ackles, ', 'Jim Beaver, ', 'Misha Collins']
## 9957 ['Emily Deschanel, ', 'Sam Jaeger, ', 'Gerardo Celasco, ', 'Madeleine Arthur']
## votes
## 9952 1,831,359
## 9953 3,130
## 9954 970,067
## 9955 199,898
## 9956 439,601
## 9957 9,786
Check data dimension
dim(movie)
## [1] 9957 9
Check list of data column
names(movie)
## [1] "title" "year" "certificate" "duration" "genre"
## [6] "rating" "description" "stars" "votes"
We need to understand the content of the data so we can structure it.
Data structuring
str(movie)
## 'data.frame': 9957 obs. of 9 variables:
## $ title : chr "Cobra Kai" "The Crown" "Better Call Saul" "Devil in Ohio" ...
## $ year : chr "(2018– )" "(2016– )" "(2015–2022)" "(2022)" ...
## $ certificate: chr "TV-14" "TV-MA" "TV-MA" "TV-MA" ...
## $ duration : chr "30 min" "58 min" "46 min" "356 min" ...
## $ genre : chr "Action, Comedy, Drama" "Biography, Drama, History" "Crime, Drama" "Drama, Horror, Mystery" ...
## $ rating : num 8.5 8.7 8.9 5.9 8.6 7.8 9.2 9.5 6.3 6.2 ...
## $ description: chr "Decades after their 1984 All Valley Karate Tournament bout, a middle-aged Daniel LaRusso and Johnny Lawrence fi"| __truncated__ "Follows the political rivalries and romance of Queen Elizabeth II's reign and the events that shaped the second"| __truncated__ "The trials and tribulations of criminal lawyer Jimmy McGill before his fateful run-in with Walter White and Jesse Pinkman." "When a psychiatrist shelters a mysterious cult escapee, her world is turned upside down as the girl's arrival t"| __truncated__ ...
## $ stars : chr "['Ralph Macchio, ', 'William Zabka, ', 'Courtney Henggeler, ', 'Xolo Maridueña']" "['Claire Foy, ', 'Olivia Colman, ', 'Imelda Staunton, ', 'Matt Smith']" "['Bob Odenkirk, ', 'Rhea Seehorn, ', 'Jonathan Banks, ', 'Patrick Fabian']" "['Emily Deschanel, ', 'Sam Jaeger, ', 'Gerardo Celasco, ', 'Madeleine Arthur']" ...
## $ votes : chr "177,031" "199,885" "501,384" "9,773" ...
After we see the structure of the data we need to clean the data so we can erase unnecassary data for our purpose.
Use subsetting library
library(gsubfn)
## Loading required package: proto
We need to make data in similar form so we can categorize the data better.
Subsetting the data
movie$votes <- gsub("[[:punct:]]", "", movie$votes)
movie$year <- gsub("I", "", movie$year)
movie$year <- gsub("[[:punct:]]", "", movie$year)
movie$year <- gsub(" ", "", movie$year)
movie$year <- substr(movie$year, 1, 4)
From the subsetting we will get a clear data about:
When a netflix movie publish (in year column). The process include erasing unnecassary character and punctuation and taking the first 4 year character of the data.
Make votes become numeric so it can become analyzed for our purpose.
After we subsetting we need to change the type of data so we can manipulate it.
Data coertion process
movie$year <- as.factor(movie$year)
movie$certificate <- as.factor (movie$certificate)
movie$duration <- as.factor(movie$duration)
movie$genre <- as.factor(movie$genre)
movie$stars <-as.factor(movie$stars)
movie$votes <- as.numeric(movie$votes)
str(movie)
## 'data.frame': 9957 obs. of 9 variables:
## $ title : chr "Cobra Kai" "The Crown" "Better Call Saul" "Devil in Ohio" ...
## $ year : Factor w/ 100 levels "","1932","1933",..: 80 78 77 84 84 84 75 70 84 84 ...
## $ certificate: Factor w/ 21 levels "","12","Approved",..: 14 16 16 16 16 16 16 16 16 8 ...
## $ duration : Factor w/ 292 levels "","1 min","10 min",..: 168 244 220 188 134 216 128 226 216 75 ...
## $ genre : Factor w/ 570 levels "","Action","Action, Adventure",..: 23 191 277 427 126 407 140 288 7 194 ...
## $ rating : num 8.5 8.7 8.9 5.9 8.6 7.8 9.2 9.5 6.3 6.2 ...
## $ description: chr "Decades after their 1984 All Valley Karate Tournament bout, a middle-aged Daniel LaRusso and Johnny Lawrence fi"| __truncated__ "Follows the political rivalries and romance of Queen Elizabeth II's reign and the events that shaped the second"| __truncated__ "The trials and tribulations of criminal lawyer Jimmy McGill before his fateful run-in with Walter White and Jesse Pinkman." "When a psychiatrist shelters a mysterious cult escapee, her world is turned upside down as the girl's arrival t"| __truncated__ ...
## $ stars : Factor w/ 8615 levels "['A. Salaam', '| ', ' Stars:', 'Shashi Kapoor, ', 'Sulakshana Pandit, ', 'Mehmood, ', 'Sudhir']",..: 6549 1595 1034 2311 8507 8000 4203 1232 5810 492 ...
## $ votes : num 177031 199885 501384 9773 15413 ...
From the new data structure we know that the data types are succesfully changed. For our analysis we need the data from 2011 to 2020 to see Netflix movie performance in that range of year.
Grouping the data into our data range
movie_new <- movie[movie$year %in% c("2011","2012","2013","2014","2015","2016","2017","2018","2019","2020"),]
movie_new$year <- droplevels(movie_new$year)
After we grouping our data, we will get our data from year 2011-2020 only. After we get our data in our data range, we need to erase unnecassary column for simpler analysis.
Subsetting unnecessary column
movie_new <- movie_new[,-c(3:5,7:8)]
We can compare the dimension our data before and after subsetting and cleaning by using the code below.
Data before
dim(movie)
## [1] 9957 9
Data after
dim(movie_new)
## [1] 6440 4
The result of data cleaning can be checked more detailed by checking the row and column of the top data.
Top row new dataset
head(movie_new)
## title year rating votes
## 1 Cobra Kai 2018 8.5 177031
## 2 The Crown 2016 8.7 199885
## 3 Better Call Saul 2015 8.9 501384
## 7 Rick and Morty 2013 9.2 502160
## 11 Stranger Things 2016 8.7 1149889
## 19 Peaky Blinders 2013 8.8 531058
Brief explaination
summary(movie_new)
## title year rating votes
## Length:6440 2019 :1316 Min. :2.000 Min. : 5
## Class :character 2020 :1209 1st Qu.:6.100 1st Qu.: 260
## Mode :character 2018 :1011 Median :6.900 Median : 1139
## 2017 : 807 Mean :6.782 Mean : 15302
## 2016 : 614 3rd Qu.:7.600 3rd Qu.: 5073
## 2015 : 469 Max. :9.900 Max. :1149902
## (Other):1014 NA's :201 NA's :201
From our data we know: 1. There are 6440 popular netflix movies from year 2011-2020. 2. Range of rating is between 2.0 to 9.9. 3. Range of votes is between 5 to 1,149,902.
We need to get a clearer view about how many movie Netflix publish per year in 2011-2020. We will categorize the data based on year and total.
Data frame of year and total
movie_total <- as.data.frame(table(movie_new$year))
colnames(movie_total) <- c("Year", "Total")
movie_total
## Year Total
## 1 2011 145
## 2 2012 213
## 3 2013 269
## 4 2014 387
## 5 2015 469
## 6 2016 614
## 7 2017 807
## 8 2018 1011
## 9 2019 1316
## 10 2020 1209
We can get the picture by plotting it into barplot. We can see the trend of popular Netflix movies published by seeing the graph below.
Barplot Total Movie per year in 2011-2020
graphics::barplot(xtabs(Total ~ Year, movie_total))
We need to get a clearer view about performance of movies produced in 2011-2020. The performance shown in total votes of movie. We will categorize the data based on year and votes.
Data frame of year and votes
movie_votes <- aggregate(x = votes ~ year,
data = movie_new,
FUN = sum)
movie_votes
## year votes
## 1 2011 7383244
## 2 2012 6084313
## 3 2013 11591278
## 4 2014 6985661
## 5 2015 8549435
## 6 2016 13779863
## 7 2017 12592004
## 8 2018 9325372
## 9 2019 11026520
## 10 2020 8149820
We can get the picture by plotting it into barplot. We can see the performance popular Netflix movies by seeing the graph below.
Barplot movies performance per year in 2011-2020
graphics::barplot(xtabs(votes ~ year, movie_votes))
We need to get a clearer view about performance of movies produced average per year. The performance shown in average votes of movie per year. We will categorize the data based on year and votes.
Data frame of year and votes
movie_votes_mean <- aggregate(x = votes ~ year,
data = movie_new,
FUN = mean)
movie_votes_mean
## year votes
## 1 2011 51631.077
## 2 2012 28835.607
## 3 2013 43413.026
## 4 2014 19244.245
## 5 2015 18425.506
## 6 2016 22776.633
## 7 2017 15878.946
## 8 2018 9400.577
## 9 2019 8567.615
## 10 2020 7315.817
We can get the picture by plotting it into barplot. We can see the average performance popular Netflix movies by seeing the graph below.
Barplot movies performance per year in 2011-2020
graphics::barplot(xtabs(votes ~ year, movie_votes_mean))
We need to get a clearer view about the average rating of popular netflix movies. We will categorize the data based on year and rating.
Data frame of year and rating
movie_rating <- aggregate(x = rating ~ year,
data = movie_new,
FUN = mean)
movie_rating
## year rating
## 1 2011 6.581818
## 2 2012 6.810427
## 3 2013 6.585768
## 4 2014 6.722590
## 5 2015 6.762931
## 6 2016 6.723802
## 7 2017 6.731021
## 8 2018 6.811895
## 9 2019 6.965113
## 10 2020 6.703591
We can get the picture by plotting it into barplot. We can see the average rating popular Netflix movies by seeing the graph below.
Barplot movies performance per year in 2011-2020
graphics::barplot(xtabs(rating ~ year, movie_rating))
head(movie_total[order(movie_total$Total, decreasing = T),],1)
## Year Total
## 9 2019 1316
tail(movie_total[order(movie_total$Total, decreasing = T),],1)
## Year Total
## 1 2011 145
head(movie_new[order(movie_new$votes, decreasing = T),],1)
## title year rating votes
## 9949 Stranger Things 2016 8.7 1149902
head(movie_new[order(movie_new$rating, decreasing = T),],1)
## title year rating votes
## 9445 BoJack Horseman 2014 9.9 16066
head(movie_rating[order(movie_rating$rating, decreasing = T),],1)
## year rating
## 9 2019 6.965113
tail(movie_rating[order(movie_rating$rating, decreasing = T),],1)
## year rating
## 1 2011 6.581818
head(movie_votes[order(movie_votes$votes, decreasing = T),],1)
## year votes
## 6 2016 13779863
tail(movie_votes[order(movie_votes$votes, decreasing = T),],1)
## year votes
## 2 2012 6084313
Explanatory Text From data manipulation and transformation above we understand some insight that can give some conclusion in Netflix popular movie. The insights are:
1. The amount of popular movie Netflix published tend to increase in every year.
2. The more Netflix publish movie doesn’t mean the votes will increase more. Means the market cap of Netflix movie already reach at the highest.
3. The average votes per movie tend to decrease means increasing movie means not increasing votes overall. The votes distributed widely in many movie but the performance per movie is decreasing.
4. The highest votes movie don’t have the highest rating and the highest rating don’t have the highest votes. The value of votes weigh more in calculate the performance of the movie. The other proof is that the average rating per year don’t change too much but the total votes per year is changing more volatile. The rating doesn’t reflect the performance Netflix movie too much.
Recommendation
From the insight above we can give some business recomendation based on the data. The recommendations are:
1. If Netflix keep increasing published movie, Netflix need a contract that paid based on views or votes of the movie. It means decreasing the upfront payment and allocating more in payment per view or votes.
2. Netflix should figure out the formula to make a movie that can increase the performance of the votes. It means Netflix should focus on votes per movie or views per movie. Increasing quality of the movie will help to increase the votes.
3. The marketcap of Netflix is at the peak, then Netflix should think a strategy to diversify its brand or product so that as a company Netflix can perform better.