netflix_data

##Objective : Netflix_data aims to serve the viewership of various TV shows that are highly rated by audience.In particular they’d like to know which tv shows most reaches the audience which includes popularity ,vote_count and on other side we see relation between each other.

First of all we have to install all packages and librarires for analysis

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggplot2)
library(readr)
library(skimr)
library(tidyr)

We can use the read_csv() function to import the data from the .csv file and created the DataFrame

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

colnames(netflix)

## [1] "first_air_date"    "origin_country"    "original_language"
## [4] "name"              "popularity"        "vote_average"     
## [7] "vote_count"        "overview"

Summary and overview of data

glimpse(netflix)

## Rows: 2,617
## Columns: 8
## $ first_air_date    <chr> "2021-09-03", "2008-01-20", "2021-11-06", "2013-12-0…
## $ origin_country    <chr> "US", "US", "US", "US", "US", "JP", "CA", "JP", "JP"…
## $ original_language <chr> "en", "en", "en", "en", "en", "ja", "en", "ja", "ja"…
## $ name              <chr> "The D'Amelio Show", "Breaking Bad", "Arcane", "Rick…
## $ popularity        <dbl> 30.104, 468.253, 95.667, 1511.996, 195.038, 106.235,…
## $ vote_average      <dbl> 9.0, 8.8, 8.7, 8.7, 8.7, 8.7, 8.7, 8.7, 8.7, 8.7, 8.…
## $ vote_count        <int> 3071, 10131, 2615, 7220, 1627, 3909, 4064, 4422, 331…
## $ overview          <chr> "From relative obscurity and a seemingly normal life…

skim_without_charts(netflix)

Data summary
Name	netflix
Number of rows	2617
Number of columns	8
_______________________
Column type frequency:
character	5
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique
first_air_date	3	1	0	10	3	2112
origin_country	0	1	2	37	0	78
original_language	0	1	2	2	0	23
name	0	1	1	85	0	2560
overview	0	1	0	2151	65	2553

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
popularity	1	59.81	222.41	0.87	16.57	27.49	49.76	6684.61
vote_average	1	7.69	0.62	0.60	7.30	7.70	8.10	9.00
vote_count	1	604.82	1223.23	99.00	150.00	257.00	569.00	19459.00

summary(netflix)

##  first_air_date     origin_country     original_language      name          
##  Length:2617        Length:2617        Length:2617        Length:2617       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    popularity        vote_average     vote_count        overview        
##  Min.   :   0.866   Min.   :0.600   Min.   :   99.0   Length:2617       
##  1st Qu.:  16.567   1st Qu.:7.300   1st Qu.:  150.0   Class :character  
##  Median :  27.489   Median :7.700   Median :  257.0   Mode  :character  
##  Mean   :  59.806   Mean   :7.692   Mean   :  604.8                     
##  3rd Qu.:  49.765   3rd Qu.:8.100   3rd Qu.:  569.0                     
##  Max.   :6684.611   Max.   :9.000   Max.   :19459.0

Now we have to find null values in dataset and in which rows null value present

sum(is.na(netflix))

## [1] 3

which(is.na(netflix))

## [1] 1223 1999 2132

netflix_df<- drop_na(netflix)

Now we have to find correlation between numeric values

cor(x= netflix_df$vote_count,y = netflix_df$popularity,method="pearson")

## [1] 0.2857536

cor(x= netflix_df$vote_average,y = netflix_df$popularity,method = "pearson")

## [1] 0.09367311

cor(x = netflix_df$vote_average,y = netflix_df$vote_count,method = "pearson")

## [1] 0.2256251

As above we have seen that there is small or no correlation between between any of numeric values(popularity,vote_count,Vote_average)

#Graohical representation

netflix_df %>% 
  top_n(100,popularity) %>%
  ggplot()+geom_bar(mapping=aes(x=original_language,y= popularity,fill = original_language),stat = "identity")

Above bar graphs shows top 3 languages are: 1. English 2. Japanese 3. Spanish

netflix_df %>% 
  top_n(10,popularity) %>%
  ggplot()+geom_bar(mapping=aes(x= popularity,y=name,fill=name),stat = "identity")

Top shows according to popularity: 1. House of the Dragon 2. The rings of power 3. Dahmer - Monster: The jefferey Dahmer story

netflix_df %>% 
  ggplot()+geom_point(mapping=aes(x = vote_average, y = popularity,fill = vote_average))

AVerage_vote has no relation with popularity exceptions most average_votes lies around 7 point

netflix_df %>% 
  ggplot()+geom_point(mapping=aes(x= vote_count,y=popularity,fill = vote_count))

Same lies with vote_count no relation and most values lies below 5000 points

netflix_df %>% 
  top_n(100,popularity) %>%
  ggplot()+geom_bar(mapping=aes(x= origin_country,y= popularity,fill = origin_country),stat = "identity")

Conclusions: 1. Most popular shows have English Original lanuage 2. Korean, Spanish , Japanese ,Turkish are most upcoming popular languages in Tv shows 3. United states,Japan,France were countries which produces most popular shows

#Note: I think thats half side to story.In my opinon we also need more data to get more concise insights

netflix_data

chitransh

2022-12-01