##Objective : Netflix_data aims to serve the viewership of various TV shows that are highly rated by audience.In particular they’d like to know which tv shows most reaches the audience which includes popularity ,vote_count and on other side we see relation between each other.
First of all we have to install all packages and librarires for analysis
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(readr)
library(skimr)
library(tidyr)
We can use the read_csv() function to import the data from the .csv file and created the DataFrame
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
colnames(netflix)
## [1] "first_air_date" "origin_country" "original_language"
## [4] "name" "popularity" "vote_average"
## [7] "vote_count" "overview"
Summary and overview of data
glimpse(netflix)
## Rows: 2,617
## Columns: 8
## $ first_air_date <chr> "2021-09-03", "2008-01-20", "2021-11-06", "2013-12-0…
## $ origin_country <chr> "US", "US", "US", "US", "US", "JP", "CA", "JP", "JP"…
## $ original_language <chr> "en", "en", "en", "en", "en", "ja", "en", "ja", "ja"…
## $ name <chr> "The D'Amelio Show", "Breaking Bad", "Arcane", "Rick…
## $ popularity <dbl> 30.104, 468.253, 95.667, 1511.996, 195.038, 106.235,…
## $ vote_average <dbl> 9.0, 8.8, 8.7, 8.7, 8.7, 8.7, 8.7, 8.7, 8.7, 8.7, 8.…
## $ vote_count <int> 3071, 10131, 2615, 7220, 1627, 3909, 4064, 4422, 331…
## $ overview <chr> "From relative obscurity and a seemingly normal life…
skim_without_charts(netflix)
| Name | netflix |
| Number of rows | 2617 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| first_air_date | 3 | 1 | 0 | 10 | 3 | 2112 | 0 |
| origin_country | 0 | 1 | 2 | 37 | 0 | 78 | 0 |
| original_language | 0 | 1 | 2 | 2 | 0 | 23 | 0 |
| name | 0 | 1 | 1 | 85 | 0 | 2560 | 0 |
| overview | 0 | 1 | 0 | 2151 | 65 | 2553 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| popularity | 0 | 1 | 59.81 | 222.41 | 0.87 | 16.57 | 27.49 | 49.76 | 6684.61 |
| vote_average | 0 | 1 | 7.69 | 0.62 | 0.60 | 7.30 | 7.70 | 8.10 | 9.00 |
| vote_count | 0 | 1 | 604.82 | 1223.23 | 99.00 | 150.00 | 257.00 | 569.00 | 19459.00 |
summary(netflix)
## first_air_date origin_country original_language name
## Length:2617 Length:2617 Length:2617 Length:2617
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## popularity vote_average vote_count overview
## Min. : 0.866 Min. :0.600 Min. : 99.0 Length:2617
## 1st Qu.: 16.567 1st Qu.:7.300 1st Qu.: 150.0 Class :character
## Median : 27.489 Median :7.700 Median : 257.0 Mode :character
## Mean : 59.806 Mean :7.692 Mean : 604.8
## 3rd Qu.: 49.765 3rd Qu.:8.100 3rd Qu.: 569.0
## Max. :6684.611 Max. :9.000 Max. :19459.0
Now we have to find null values in dataset and in which rows null value present
sum(is.na(netflix))
## [1] 3
which(is.na(netflix))
## [1] 1223 1999 2132
netflix_df<- drop_na(netflix)
Now we have to find correlation between numeric values
cor(x= netflix_df$vote_count,y = netflix_df$popularity,method="pearson")
## [1] 0.2857536
cor(x= netflix_df$vote_average,y = netflix_df$popularity,method = "pearson")
## [1] 0.09367311
cor(x = netflix_df$vote_average,y = netflix_df$vote_count,method = "pearson")
## [1] 0.2256251
As above we have seen that there is small or no correlation between between any of numeric values(popularity,vote_count,Vote_average)
#Graohical representation
netflix_df %>%
top_n(100,popularity) %>%
ggplot()+geom_bar(mapping=aes(x=original_language,y= popularity,fill = original_language),stat = "identity")
Above bar graphs shows top 3 languages are: 1. English 2. Japanese 3.
Spanish
netflix_df %>%
top_n(10,popularity) %>%
ggplot()+geom_bar(mapping=aes(x= popularity,y=name,fill=name),stat = "identity")
Top shows according to popularity: 1. House of the Dragon 2. The rings
of power 3. Dahmer - Monster: The jefferey Dahmer story
netflix_df %>%
ggplot()+geom_point(mapping=aes(x = vote_average, y = popularity,fill = vote_average))
AVerage_vote has no relation with popularity exceptions most
average_votes lies around 7 point
netflix_df %>%
ggplot()+geom_point(mapping=aes(x= vote_count,y=popularity,fill = vote_count))
Same lies with vote_count no relation and most values lies below 5000
points
netflix_df %>%
top_n(100,popularity) %>%
ggplot()+geom_bar(mapping=aes(x= origin_country,y= popularity,fill = origin_country),stat = "identity")
Conclusions: 1. Most popular shows have English Original lanuage 2. Korean, Spanish , Japanese ,Turkish are most upcoming popular languages in Tv shows 3. United states,Japan,France were countries which produces most popular shows
#Note: I think thats half side to story.In my opinon we also need more data to get more concise insights