As part of this project , I will analyzing the Netflix TV shows and movie data and will try to answer the below research questions. The dataset consists of TV shows and movies available on Netflix as of 2021. The dataset is collected from Flixable which is a third-party Netflix search engine.
As part of this project , we perform the following analysis on the Netflix content (movies and TV shows)
There are 7787 observations in the dataset which is either a movie or a TV show and the below 12 variables describes the data:
show_id - Unique ID for Movie / TV Show
type - Identifier - A Movie or TV Show
title - Title of the Movie / TV Show
director - Director of the Movie
cast - Actors involved in the movie / show
country - Country where the movie / show was produced
date_added - Date it was added on Netflix
release_year - Actual Release year of the move / show
rating - TV Rating of the movie / show
duration - Total Duration - in minutes or number of seasons
listed_in - Genres
description - The summary description
This is an observational data
The data is collected from below website
Perform a regression analysis between the year and movie duration , also between the number seasons of tv shows over the year. Movie year is independent variable and duration is dependent variable
library(tidyverse)
library(dplyr)
library(nortest)
data <- read.csv("https://raw.githubusercontent.com/rathish-ps/Data607-Assignment/main/netflix_titles.csv")
data <- data %>%
mutate(country = replace(country, country == "", NA ))
data$country[data$country==""]<-NA
data_without_countryNA <- data %>% drop_na()
data$director <- iconv(data$director, 'utf-8', 'ASCII//TRANSLIT', sub='')
data$title <- iconv(data$title, 'utf-8', 'ASCII//TRANSLIT', sub='')
data$cast <- iconv(data$cast, 'utf-8', 'ASCII//TRANSLIT', sub='')
data$description <- iconv(data$description, 'utf-8', 'ASCII//TRANSLIT', sub='')
#data$date_added <- parse_date_time(data$date_added, orders = "mdy")
data$date_added <-as.Date(strptime(data$date_added, format = "%d-%b-%y"))
data1<- data %>%
group_by(type) %>%
summarise(Total = n()) %>%
mutate( Percentage =Total/sum(Total)*100 )
ggplot(data=data1)+
geom_bar(aes(x="", y=Total, fill=type), stat="identity", width = 1)+
coord_polar("y", start=0)+
ggtitle("TV Shows vs Movies") +
theme_void()+
geom_text(aes(x=1, y = Total, label=sprintf("%0.1f%s", round(Percentage, digits = 3),"%")))
From the Graph we could see that 69,1% of the content from Netflix are movies and the 30.9% are TV Shows
data %>% filter(as.numeric(format(date_added,'%Y'))<2021) %>%
group_by(year=as.numeric(format(date_added,'%Y')), type) %>%
count()%>%
ggplot() + geom_line(aes(x = year, y = n, group = type, color = type)) +
ggtitle("Movies and TV Shows added per year") +
labs(x = "Year", y = "Count")
From the Graph we could see that from 2015 on wards Netflix started adding more TV shows and Movies.
data_show<-data%>% filter(type =='TV Show')
data_show1<-data_show%>%mutate(durationNum =as.numeric(strsplit(duration,"Seasons")))
data_show2<-data_show1%>% filter(is.na(durationNum))%>%mutate(durationNum =as.numeric(strsplit(duration,"Season")))
data_show1<-data_show1%>%filter(!is.na(durationNum))
data_show<-rbind(data_show1,data_show2)
# find the tv shows by country
data4 <- data_show %>%filter(!is.na(country)) %>%
mutate(country = strsplit(as.character(country), ", ")) %>%
unnest(country) %>%
group_by(country,durationNum)
summary4 <- data4 %>%
group_by(country) %>%
summarise(Total = n())
summary4<-summary4[order(summary4$Total, decreasing = TRUE), ]
summary4<- summary4%>%top_n(3)
# we are considering only top 3 countries by movie
countriesTvShow <-summary4$country
#countriesTvShow
summaryTopTvShow <- data4 %>%filter(!is.na(date_added))%>%mutate( year_added =as.numeric(format(date_added,'%Y') ))%>%
group_by(country,year_added) %>% summarise(Total = n())
summaryTopTvShow<- summaryTopTvShow %>%filter(country %in% countriesTvShow)
summaryTopTvShow %>% filter(year_added < 2021) %>%
ggplot() + geom_line(aes(x = year_added, y = Total, group = country, color = country)) +
ggtitle("TV Shows distribution") +
labs(x = "Year", y = "Count")
# for movies
data_movie<-data%>% filter(type =='Movie')
data_movie<-data_movie%>%mutate(durationNum =as.numeric(strsplit(duration,"min")))
data2 <- data_movie %>%
mutate(country = strsplit(as.character(country), ", ")) %>%
unnest(country) %>%
group_by(country,durationNum)
summary2 <- data2 %>% filter(!is.na(country)) %>%
group_by(country) %>%
summarise(Total = n())
summary2<-summary2[order(summary2$Total, decreasing = TRUE), ]
summary2<- summary2%>%top_n(10)
summary6<- summary2%>%top_n(3)
countriesMovies <-summary6$country
# countriesMovies
summaryTopMoies <- data_movie %>%filter(!is.na(date_added))%>%mutate( year_added =as.numeric(format(date_added,'%Y') ))%>%
group_by(country,year_added) %>% summarise(Total = n())
summaryTopMoies<- summaryTopMoies %>%filter(country %in% countriesMovies)
summaryTopMoies %>% filter(year_added < 2021) %>%
ggplot() + geom_line(aes(x = year_added, y = Total, group = country, color = country)) +
ggtitle("Movie distribution") +
labs(x = "Year", y = "Count")
Analyze the duration of movies
data_movie<-data%>% filter(type =='Movie')
data_movie<-data_movie%>%mutate(durationNum =as.numeric(strsplit(duration,"min")))
ggplot(data_movie) +
geom_histogram(binwidth = 10, aes(x = durationNum)) +
labs(title = 'Durartion distribution of Movies')+
labs(x = "Duration", y = "Movies")
#check whether it follows a normal distribution?
ad.test(data_movie$durationNum)
##
## Anderson-Darling normality test
##
## data: data_movie$durationNum
## A = 47.076, p-value < 2.2e-16
Since the p-value < 0.05 means the time duration of movies are not following a normal distribution
Analyze the number of seasons in TV Shows
data_show<-data%>% filter(type =='TV Show')
data_show1<-data_show%>%mutate(durationNum =as.numeric(strsplit(duration,"Seasons")))
data_show2<-data_show1%>% filter(is.na(durationNum))%>%mutate(durationNum =as.numeric(strsplit(duration,"Season")))
data_show1<-data_show1%>%filter(!is.na(durationNum))
data_show<-rbind(data_show1,data_show2)
ggplot(data_show) +
geom_histogram(binwidth = 1, aes(x = durationNum)) +
labs(title = 'Distribution of TV Shows Seasons')+
labs(x = "# of seasons", y = "Count") +
theme_minimal()
ad.test(data_show$durationNum)
##
## Anderson-Darling normality test
##
## data: data_show$durationNum
## A = 386.27, p-value < 2.2e-16
From the histogram it is clear most of the TV shows has only one season and the distribution is right skewed. Also from the p-value (2.2e-16 ) it is clear the number of seasons doesn’t follow a normal distribution.
data_movie %>%
ggplot( aes(x = release_year,
y = durationNum)) +
geom_point() +
geom_smooth(method = "lm")+
ggtitle("Release Year VS Movie Duration") +
labs(x = "Release Year", y = "Duration")
cor.test(data_movie$release_year,data_movie$durationNum)
##
## Pearson's product-moment correlation
##
## data: data_movie$release_year and data_movie$durationNum
## t = -15.347, df = 5375, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2303589 -0.1791420
## sample estimates:
## cor
## -0.2048907
Based on that correlation coefficient, linear relationship between the two variables is low.
lm1 <- lm(release_year~ durationNum, data =data_movie)
#summary(lm1)$r.squared
summary(lm1)
##
## Call:
## lm(formula = release_year ~ durationNum, data = data_movie)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.562 -1.150 2.712 4.920 19.840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.020e+03 4.672e-01 4323.15 <2e-16 ***
## durationNum -6.940e-02 4.522e-03 -15.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.459 on 5375 degrees of freedom
## Multiple R-squared: 0.04198, Adjusted R-squared: 0.0418
## F-statistic: 235.5 on 1 and 5375 DF, p-value: < 2.2e-16
From the analysis it is clear that the correlation coefficient(-0.20) is low and the R-square is 4.1 % meaning there is no linear relation exist between release year and movie duration.
data_show %>%
ggplot( aes(x = release_year,
y = durationNum)) +
geom_point() +
geom_smooth(method = "lm")+
ggtitle("Release Year Vs No of Seasons") +
labs(x = "Release Year", y = "#Seasons")
cor(data_show$release_year,data_show$durationNum)
## [1] -0.0912783
Based on that correlation coefficient, linear relationship between the two variables is low (-0.09)
lm1 <- lm(release_year~ durationNum, data =data_show)
summary(lm1)
##
## Call:
## lm(formula = release_year ~ durationNum, data = data_show)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.443 -0.795 1.557 2.881 7.416
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2016.76694 0.17195 11728.969 < 2e-16 ***
## durationNum -0.32391 0.07201 -4.498 7.19e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.642 on 2408 degrees of freedom
## Multiple R-squared: 0.008332, Adjusted R-squared: 0.00792
## F-statistic: 20.23 on 1 and 2408 DF, p-value: 7.188e-06
From the analysis it is clear that the correlation coefficient(-0.09) is low and the R-square is .8 % meaning there is no linear relation exist between release year and number of seasons.
data2 <- data_movie %>%
mutate(country = strsplit(as.character(country), ", ")) %>%
unnest(country) %>%
group_by(country,durationNum)
summary2 <- data2 %>% filter(!is.na(country)) %>%
group_by(country) %>%
summarise(Total = n())
summary2<-summary2[order(summary2$Total, decreasing = TRUE), ]
summary2<- summary2%>%top_n(10)
# we are considering only top 10 countries by movie
countries <-summary2$country
data2<- data_movie %>%
filter(country %in% countries)
data2 %>%
mutate(country = reorder(country, durationNum, FUN = mean)) %>%
ggplot(aes(country, durationNum, fill = country)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Country VS Movie Duration") +
labs(x = "Country", y = "Movie Duration")
data2 %>%
filter(country %in% countries) %>%
group_by(country) %>%
summarise(
size = n(),
mean = mean(durationNum, na.rm = TRUE),
sd = sd(durationNum, na.rm = TRUE)
)
## # A tibble: 10 x 4
## country size mean sd
## <chr> <int> <dbl> <dbl>
## 1 Canada 118 82.4 24.6
## 2 China 21 110. 12.5
## 3 France 69 93.2 16.3
## 4 Germany 42 94.0 23.3
## 5 India 852 127. 25.5
## 6 Japan 69 96.2 26.3
## 7 Mexico 65 86.8 23.9
## 8 Spain 89 99.9 17.3
## 9 United Kingdom 193 84.4 24.4
## 10 United States 1850 89.4 25.3
# Compute the analysis of variance
anova_country <- aov(durationNum ~ country, data = data2)
# Summary of the analysis
summary(anova_country)
## Df Sum Sq Mean Sq F value Pr(>F)
## country 9 948518 105391 170.4 <2e-16 ***
## Residuals 3358 2076567 618
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hypothesis: the time duration of movies are the same among the countries.
The P-Value ( 2e-16) is very small (< 0.05) and therefore we reject the null hypothesis.
It means that the difference of time duration across the countries selected is statistically significant.
data3 <- data_movie %>%
mutate(listed_in = strsplit(as.character(listed_in), ", ")) %>%
unnest(listed_in) %>%
group_by(listed_in,durationNum)
data3 %>%
mutate(listed_in = reorder(listed_in, durationNum, FUN = mean)) %>%
ggplot(aes(listed_in, durationNum, fill = listed_in)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Genre VS Movie Duration") +
labs(x = "Genre", y = "Movie Duration")
data3 %>%
group_by(listed_in) %>%
summarise(
size = n(),
mean = mean(durationNum, na.rm = TRUE),
sd = sd(durationNum, na.rm = TRUE)
)
## # A tibble: 20 x 4
## listed_in size mean sd
## <chr> <int> <dbl> <dbl>
## 1 Action & Adventure 721 113. 24.9
## 2 Anime Features 57 91.2 21.8
## 3 Children & Family Movies 532 79.7 27.6
## 4 Classic Movies 103 116. 37.4
## 5 Comedies 1471 105. 24.2
## 6 Cult Movies 59 105. 25.1
## 7 Documentaries 786 81.4 23.7
## 8 Dramas 2106 113. 25.3
## 9 Faith & Spirituality 57 105. 29.3
## 10 Horror Movies 312 97.5 15.7
## 11 Independent Movies 673 101. 18.2
## 12 International Movies 2437 111. 26.5
## 13 LGBTQ Movies 90 94.0 21.8
## 14 Movies 56 45.9 21.6
## 15 Music & Musicals 321 109. 32.3
## 16 Romantic Movies 531 111. 21.8
## 17 Sci-Fi & Fantasy 218 106. 27.9
## 18 Sports Movies 196 95.4 24.4
## 19 Stand-Up Comedy 329 67.1 13.0
## 20 Thrillers 491 107. 19.6
# Compute the analysis of variance
anova_genre <- aov(durationNum ~ listed_in, data = data3)
# Summary of the analysis
summary(anova_genre)
## Df Sum Sq Mean Sq F value Pr(>F)
## listed_in 19 1816985 95631 158.6 <2e-16 ***
## Residuals 11526 6948282 603
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hypothesis: the time duration of the movies are the same among the genres.
The P-Value(2e-16) is very small (< 0.05) and therefore we reject the null hypothesis. It means that at least two of the means are different, the difference in movie duration between the genres is statistically significant.
Analyzed the Netflix content data using different plots like scatter plots, line chart, boxplot and bar charts. We found that 69.1% of the content from Netflix are movies and the 30.9% are TV Shows
The top 3 countries creating TV Show content for Netflix are United States, United Kingdom and Japan and top 3 countries creating TV Movie content are United States, India and United Kingdom.
There is no linear relationship between time duration in the Movies/ No of seasons in TV Shows and the release year. From our analysis We noticed that correlation coefficient and R-square values are low.
From the variance analysis (AOV),the difference of time duration of movies across the countries is statistically significant. The same happens with the genres