library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)

##Introduction

The rapid growth of Over-The-Top (OTT) platforms has significantly transformed the way content is consumed, especially among Generation Z. Platforms like Amazon Prime provide a vast library of movies and TV shows, making them a crucial part of digital entertainment. This project focuses on analyzing the Amazon Prime Movies and TV Shows dataset using R programming. The objective is to perform data cleaning, transformation, and extract meaningful insights from the dataset.

data <- read.csv("C:/Users/cenzo/OneDrive/Desktop/amazon_prime_titles.csv")

# View first 6 rows
head(data)
##   show_id  type                 title       director
## 1      s1 Movie   The Grand Seduction   Don McKellar
## 2      s2 Movie  Take Care Good Night   Girish Joshi
## 3      s3 Movie  Secrets of Deception    Josh Webber
## 4      s4 Movie    Pink: Staying True Sonia Anderson
## 5      s5 Movie         Monster Maker   Giles Foster
## 6      s6 Movie Living With Dinosaurs   Paul Weiland
##                                                                                                                                    cast
## 1                                                                                        Brendan Gleeson, Taylor Kitsch, Gordon Pinsent
## 2                                                                                      Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar
## 3                                               Tom Sizemore, Lorenzo Lamas, Robert LaSardo, Richard Jones, Yancey Arias, Noel Gugliemi
## 4                                                      Interviews with: Pink, Adele, Beyoncé, Britney Spears, Christina Aguilera, more!
## 5 Harry Dean Stanton, Kieran O'Brien, George Costigan, Amanda Dickinson, Alison Steadman, Grant Bardsley, Bill Moody, Matthew Scurfield
## 6                                                                     Gregory Chisholm, Juliet Stevenson, Brian Henson, Michael Maloney
##          country     date_added release_year rating duration
## 1         Canada March 30, 2021         2014         113 min
## 2          India March 30, 2021         2018    13+  110 min
## 3  United States March 30, 2021         2017          74 min
## 4  United States March 30, 2021         2014          69 min
## 5 United Kingdom March 30, 2021         1989          45 min
## 6 United Kingdom March 30, 2021         1989          52 min
##                 listed_in
## 1           Comedy, Drama
## 2    Drama, International
## 3 Action, Drama, Suspense
## 4             Documentary
## 5          Drama, Fantasy
## 6           Fantasy, Kids
##                                                                                                                                                                                                                                                                                                                                                                                             description
## 1                   A small fishing village must procure a local doctor to secure a lucrative business contract. When unlikely candidate and big city doctor Paul Lewis lands in their lap for a trial residence, the townsfolk rally together to charm him into staying. As the doctor's time in the village winds to a close, acting mayor Murray French has no choice but to pull out all the stops.
## 2                                                                                                                                                                                                                                                                                                               A Metro Family decides to fight a Cyber Criminal threatening their stability and pride.
## 3                                                                                                                                                                                                                                                                             After a man discovers his wife is cheating on him with a neighborhood kid he goes down a furious path of self-destruction
## 4                                                                                                                                                     Pink breaks the mold once again, bringing her career to a new level in 2013 with a world tour that entertains unlike ever before! Get inside access to "the girl who got the party started" with exclusive interviews and rare live performances.
## 5                                                            Teenage Matt Banting wants to work with a famous but eccentric creature/special effects man named Chancey Bellows. He gets more than he bargained for when one of the creatures, the giant dragon-like Ultragorgon, takes Matt under his wing. Matt is forced to confront his inner monsters while working out his issues with his father.
## 6 The story unfolds in a an English seaside town, where Dom, an only child, faces the imminent arrival of a new sibling, and subsequently diminished attention from his mother. A stuffed toy dinosaur named Dog is Dom's only confidant – Dom relies on his friend heavily for support as he confronts his problems, accepts the changes in his life, and understands the love he has for his parents.
# Structure of dataset
str(data)
## 'data.frame':    9668 obs. of  12 variables:
##  $ show_id     : chr  "s1" "s2" "s3" "s4" ...
##  $ type        : chr  "Movie" "Movie" "Movie" "Movie" ...
##  $ title       : chr  "The Grand Seduction" "Take Care Good Night" "Secrets of Deception" "Pink: Staying True" ...
##  $ director    : chr  "Don McKellar" "Girish Joshi" "Josh Webber" "Sonia Anderson" ...
##  $ cast        : chr  "Brendan Gleeson, Taylor Kitsch, Gordon Pinsent" "Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar" "Tom Sizemore, Lorenzo Lamas, Robert LaSardo, Richard Jones, Yancey Arias, Noel Gugliemi" "Interviews with: Pink, Adele, Beyoncé, Britney Spears, Christina Aguilera, more!" ...
##  $ country     : chr  "Canada" "India" "United States" "United States" ...
##  $ date_added  : chr  "March 30, 2021" "March 30, 2021" "March 30, 2021" "March 30, 2021" ...
##  $ release_year: int  2014 2018 2017 2014 1989 1989 2017 2016 2017 1994 ...
##  $ rating      : chr  "" "13+" "" "" ...
##  $ duration    : chr  "113 min" "110 min" "74 min" "69 min" ...
##  $ listed_in   : chr  "Comedy, Drama" "Drama, International" "Action, Drama, Suspense" "Documentary" ...
##  $ description : chr  "A small fishing village must procure a local doctor to secure a lucrative business contract. When unlikely cand"| __truncated__ "A Metro Family decides to fight a Cyber Criminal threatening their stability and pride." "After a man discovers his wife is cheating on him with a neighborhood kid he goes down a furious path of self-destruction" "Pink breaks the mold once again, bringing her career to a new level in 2013 with a world tour that entertains u"| __truncated__ ...

Data Understanding

dim(data)
## [1] 9668   12
colnames(data)
##  [1] "show_id"      "type"         "title"        "director"     "cast"        
##  [6] "country"      "date_added"   "release_year" "rating"       "duration"    
## [11] "listed_in"    "description"
summary(data)
##    show_id              type              title             director        
##  Length:9668        Length:9668        Length:9668        Length:9668       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      cast             country           date_added         release_year 
##  Length:9668        Length:9668        Length:9668        Min.   :1920  
##  Class :character   Class :character   Class :character   1st Qu.:2007  
##  Mode  :character   Mode  :character   Mode  :character   Median :2016  
##                                                           Mean   :2008  
##                                                           3rd Qu.:2019  
##                                                           Max.   :2021  
##     rating            duration          listed_in         description       
##  Length:9668        Length:9668        Length:9668        Length:9668       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
## 

Data Understanding

colSums(is.na(data))
##      show_id         type        title     director         cast      country 
##            0            0            0            0            0            0 
##   date_added release_year       rating     duration    listed_in  description 
##            0            0            0            0            0            0

Interpretation Initially, the dataset had no missing values, which is uncommon in real-world datasets.

Data Cleaning

#  Check duplicate val is there or not

sum(duplicated(data))
## [1] 0
# Replace NA with "Unknown"
data$country[is.na(data$country)] <- "Unknown"

# Remove leading/trailing spaces
data$country <- trimws(data$country)

#  Replace empty or blank values with "Unknown"
data$country[data$country == "" | data$country == " "] <- "Unknown"

#  Split multiple countries
data_clean <- data %>%
  separate_rows(country, sep = ",")

#  Trim again after split
data_clean$country <- trimws(data_clean$country)

#  Remove any still-empty values
data_clean <- data_clean[data_clean$country != "", ]

Interpretation: This step checks for duplicate values in the dataset. The result is 0, which means there are no duplicate rows present. Therefore, no duplicate removal is required and the dataset is already clean.

##DATA TRANSFORMATION

# Convert Type to Factor
data$type <- as.factor(data$type)

#Convert Rating to Factor
data$rating <- as.factor(data$rating)


#Extract Year from date_added
data$year_added <- substr(data$date_added, nchar(data$date_added)-3, nchar(data$date_added))


#Create Content Age Category
data$content_age <- ifelse(data$release_year > 2015, "New", "Old")


#Convert Country to Factor
data$country <- as.factor(data$country)

Interpretation: In this step, important transformations were performed to prepare the data for analysis. Categorical variables like type, rating, and country were converted into factors for better grouping and visualization. A new variable ‘year_added’ was created by extracting the year from the date_added column to analyze trends over time. Another variable ‘content_age’ was created to classify content as “New” or “Old”, enabling comparative analysis. These transformations enhance the usability and analytical power of the dataset.

##SORTING AND FILTERING

# Latest Content

data %>%
  arrange(desc(release_year)) %>%
  head()
##   show_id    type                                        title
## 1     s97   Movie                                     Wildlike
## 2    s114   Movie                                  White Tiger
## 3    s135 TV Show WGC-Dell Technologies Match Play Reveal Show
## 4    s152   Movie                                 War of Likes
## 5    s203   Movie                      V1 Murder Case (Telugu)
## 6    s221   Movie                                  Underplayed
##             director
## 1   Frank Hall Green
## 2 Karen Shakhnazarov
## 3                   
## 4       María Ripoll
## 5  Pavel Navageethan
## 6         Stacey Lee
##                                                                                                                                                                                                                                                                                           cast
## 1                                                                                                                                                                                                                                                Bruce Greenwood, Ella Purnell, Brian Geraghty
## 2                                                                                                                                                                                                                                         Aleksey Vertkov, Vitaliy Kishchenko, Valeriy Grishko
## 3                                                                                                                                                                                                                                                                                             
## 4 Ludwika Paleta, Regina Blandón, Manolo Cardona, Michelle Rodríguez, Loreto Peralta, Mauricio “Diablito” Barrientos, Catalina López, Patricia Bernal, Héctor Jiménez, Siouzanna Melikián, Paulette Hernández, Pamela Cortés, José Sefami, Pablo Cruz, Elsa Ortíz, Enrique Olvera, Miguel Bosé
## 5                                                                                                                                                                                                                       Ram Arun Castro, Vishnupriya Pillai, Gayathri Raja, Lijeesh, Mime Gopi
## 6                                                                                                                                                                                                                     TokiMonsta, Alison Wonderland, Nervo, Rezz, Nightwave, Sherelle, Tygapaw
##   country date_added release_year rating duration               listed_in
## 1 Unknown                    2021    16+  104 min           Action, Drama
## 2 Unknown                    2021    16+  109 min Action, Science Fiction
## 3 Unknown                    2021  TV-NR 1 Season                  Sports
## 4 Unknown                    2021    16+  104 min                  Comedy
## 5 Unknown                    2021    13+  111 min        Action, Suspense
## 6 Unknown                    2021    13+   90 min             Documentary
##                                                                                                                                                                                                                                                                                                                                                                                                 description
## 1                                                                                                                                                                                                                                                                  An unlikely friendship forms in the spectacular Alaskan wilderness, giving a runaway girl hope and sanctuary in America's last frontier.
## 2                                                                                                                                                                                                                     Great Patriotic War, early 1940s. After barely surviving a battle with a mysterious, ghostly-white Tiger tank, Red Army Sergeant Ivan Naydenov becomes obsessed with its destruction.
## 3                                                                                                                                                   Host Jonathan Coachman and PGATOUR.COM Senior Writer Ben Everill break down the brackets for the 2021 World Golf Championships Dell Technologies Match Play at Austin Country Club with some special guests, including defending champion Kevin Kisner.
## 4                                                                                                                                                              In order to advance her career in the dynamic world of publicity in Mexico City, Raquel tries to reunite with her high school friend Cecy who has become the queen of social media. But unlike followers, friendships do not come instantly.
## 5 The film begins with the murder of a girl who was in a live-in relationship. Agni, a forensic expert with nyctophobia, decides to take on this murder investigation. At its core, V1 Murder Case is perhaps a film about confronting the most fearful, darkest parts of our minds. When three people connected to the victim are brought in for questioning, each narrates the same incident differently.
## 6                                    Filmed over the summer festival season, Underplayed presents a portrait of the current status of the gender, ethnic, and sexuality equality issues in dance music. Seen through the lens of the female pioneers, next-generation artists and industry leaders who are championing the change, and inspiring a more diverse pool of role models for future generations.
##   year_added content_age
## 1                    New
## 2                    New
## 3                    New
## 4                    New
## 5                    New
## 6                    New
# Oldest Content
data %>%
  arrange(release_year) %>%
  head()
##   show_id  type               title                         director
## 1     s84 Movie    Within Our Gates                   Oscar Micheaux
## 2   s1285 Movie           Pollyanna                      Paul Powell
## 3   s1475 Movie Nomads Of The North                   David Hartford
## 4   s1144 Movie Robin Hood (Silent)                       Allan Dwan
## 5   s1426 Movie  One Exciting Night                    D.W. Griffith
## 6   s1685 Movie      Merry-Go-Round Rupert Julian, Eric von Stroheim
##                                                                  cast country
## 1                                          Evelyn Preer, Flo Clements Unknown
## 2 Mary Pickford, Wharton James, Katherine Griffith, Helen Jerome Eddy Unknown
## 3                                            Betty Blythe, Lon Chaney Unknown
## 4   Douglas Fairbanks Sr., Enid Bennett, Wallace Beery, Alan Hale Sr. Unknown
## 5                                          Carol Dempster, Henry Hull Unknown
## 6              Norman Kerry, Mary Philbin, Albert Conti, Al Edmundsen Unknown
##   date_added release_year rating duration           listed_in
## 1                    1920    13+   78 min               Drama
## 2                    1920     NR   60 min Comedy, Drama, Kids
## 3                    1920    13+   78 min               Drama
## 4                    1922    13+  121 min              Action
## 5                    1922    13+  145 min            Suspense
## 6                    1923    13+  113 min               Drama
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               description
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Abandoned by her fiance, an educated negro woman with a shocking past dedicates herself to helping a near bankrupt school for impoverished negro youths.
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         When Pollyanna is orphaned, she is sent to live with her crotchety Aunt Polly. Pollyanna discovers that many of the people in her aunt's New England home town are as ill-tempered as her aunt.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       A Canadian Mountie allows an innocent fugitive to escape with the women he loves.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Amid big-budget medieval pageantry, King Richard goes on the Crusades leaving his brother Prince John as regent, who promptly emerges as a cruel, grasping, treacherous tyrant.
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                A young orphan girl, courted by an unpleasant older wealthy man who has a hold over her adoptive mother, falls in love with a young stranger at a party.
## 6 In Vienna circa World War I, decadent Count Franz Maximillian von Hohenegg (Norman Kerry), posing as a necktie salesman, falls in love with innocent Agnes (Mary Philbin), an organ grinder in the city's pleasure zone. An edgy assortment of characters including Franz's cigar-puffing mother, a sadistic brute and rapist, a sensitive hunchback and his trained gorilla, and royalty and circus freaks populate this lavishly-produced and fascinating cinematic classic. Though his name appears nowhere on the final product, legendary director Eric von Stroheim (Greed) penned the story and script, oversaw the sets and costumes, selected the cast, and directed at least one-quarter of the film before being fired by production head Irving Thalberg and replaced by director Rupert Julian (The Phantom of the Opera).
##   year_added content_age
## 1                    Old
## 2                    Old
## 3                    Old
## 4                    Old
## 5                    Old
## 6                    Old
# Top Recent Content
data %>%
  filter(release_year >= 2020) %>%
  arrange(desc(release_year)) %>%
  head()
##   show_id    type                                        title
## 1     s97   Movie                                     Wildlike
## 2    s114   Movie                                  White Tiger
## 3    s135 TV Show WGC-Dell Technologies Match Play Reveal Show
## 4    s152   Movie                                 War of Likes
## 5    s203   Movie                      V1 Murder Case (Telugu)
## 6    s221   Movie                                  Underplayed
##             director
## 1   Frank Hall Green
## 2 Karen Shakhnazarov
## 3                   
## 4       María Ripoll
## 5  Pavel Navageethan
## 6         Stacey Lee
##                                                                                                                                                                                                                                                                                           cast
## 1                                                                                                                                                                                                                                                Bruce Greenwood, Ella Purnell, Brian Geraghty
## 2                                                                                                                                                                                                                                         Aleksey Vertkov, Vitaliy Kishchenko, Valeriy Grishko
## 3                                                                                                                                                                                                                                                                                             
## 4 Ludwika Paleta, Regina Blandón, Manolo Cardona, Michelle Rodríguez, Loreto Peralta, Mauricio “Diablito” Barrientos, Catalina López, Patricia Bernal, Héctor Jiménez, Siouzanna Melikián, Paulette Hernández, Pamela Cortés, José Sefami, Pablo Cruz, Elsa Ortíz, Enrique Olvera, Miguel Bosé
## 5                                                                                                                                                                                                                       Ram Arun Castro, Vishnupriya Pillai, Gayathri Raja, Lijeesh, Mime Gopi
## 6                                                                                                                                                                                                                     TokiMonsta, Alison Wonderland, Nervo, Rezz, Nightwave, Sherelle, Tygapaw
##   country date_added release_year rating duration               listed_in
## 1 Unknown                    2021    16+  104 min           Action, Drama
## 2 Unknown                    2021    16+  109 min Action, Science Fiction
## 3 Unknown                    2021  TV-NR 1 Season                  Sports
## 4 Unknown                    2021    16+  104 min                  Comedy
## 5 Unknown                    2021    13+  111 min        Action, Suspense
## 6 Unknown                    2021    13+   90 min             Documentary
##                                                                                                                                                                                                                                                                                                                                                                                                 description
## 1                                                                                                                                                                                                                                                                  An unlikely friendship forms in the spectacular Alaskan wilderness, giving a runaway girl hope and sanctuary in America's last frontier.
## 2                                                                                                                                                                                                                     Great Patriotic War, early 1940s. After barely surviving a battle with a mysterious, ghostly-white Tiger tank, Red Army Sergeant Ivan Naydenov becomes obsessed with its destruction.
## 3                                                                                                                                                   Host Jonathan Coachman and PGATOUR.COM Senior Writer Ben Everill break down the brackets for the 2021 World Golf Championships Dell Technologies Match Play at Austin Country Club with some special guests, including defending champion Kevin Kisner.
## 4                                                                                                                                                              In order to advance her career in the dynamic world of publicity in Mexico City, Raquel tries to reunite with her high school friend Cecy who has become the queen of social media. But unlike followers, friendships do not come instantly.
## 5 The film begins with the murder of a girl who was in a live-in relationship. Agni, a forensic expert with nyctophobia, decides to take on this murder investigation. At its core, V1 Murder Case is perhaps a film about confronting the most fearful, darkest parts of our minds. When three people connected to the victim are brought in for questioning, each narrates the same incident differently.
## 6                                    Filmed over the summer festival season, Underplayed presents a portrait of the current status of the gender, ethnic, and sexuality equality issues in dance music. Seen through the lens of the female pioneers, next-generation artists and industry leaders who are championing the change, and inspiring a more diverse pool of role models for future generations.
##   year_added content_age
## 1                    New
## 2                    New
## 3                    New
## 4                    New
## 5                    New
## 6                    New
#Most Content-Producing Countries
data %>%
  count(country, sort = TRUE) %>%
  head()
##                         country    n
## 1                       Unknown 8996
## 2                 United States  253
## 3                         India  229
## 4                United Kingdom   28
## 5                        Canada   16
## 6 United Kingdom, United States   12
#Popular Genres (Audience Taste)

data %>%
  count(listed_in, sort = TRUE) %>%
  head()
##         listed_in   n
## 1           Drama 986
## 2          Comedy 536
## 3 Drama, Suspense 399
## 4   Comedy, Drama 377
## 5 Animation, Kids 356
## 6     Documentary 350

Interpretation

The sorting and filtering analysis shows that the dataset is mainly focused on recent content (2020–2021), indicating a strong emphasis on trending material, while also including older titles that ensure content diversity. It highlights that countries like USA and India dominate production, reflecting major industry influence, and that popular genres such as drama, comedy, and action align with audience preferences. Overall, the platform balances modern trends, classic content, and high-demand categories to maximize engagement

##DATA VISUALIZATION

#Content Distribution (Movies vs TV Shows)
ggplot(data, aes(x = type, fill = type)) +
  geom_bar() +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.3) +
  labs(title = "Distribution of Movies vs TV Shows",
       x = "Type", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#Content Release Trend (Improved)

ggplot(data, aes(x = release_year)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "black") +
  labs(title = "Content Release Trend",
       x = "Year", y = "Count") +
  theme_minimal()

#Top Countries 
top_countries <- data %>%
  count(country, sort = TRUE) %>%
  head(10)



ggplot(top_countries, aes(x=reorder(country, n), y=n)) +
  geom_bar(stat="identity") +
  coord_flip() +
  labs(title="Top 10 Content Producing Countries",
       x="Country", y="Count")

## Fixed the horizontal bar plot and the highest bar with unknown tag. # Fixed the repeated country at same time.

#  Remove NA / Unknown completely
data_clean <- data %>%
  filter(!is.na(country) & country != "Unknown")

# Split multiple countries into rows
data_clean <- data_clean %>%
  separate_rows(country, sep=",")

# Trim again
data_clean$country <- trimws(data_clean$country)

#Combine Same countries 
top_countries <- data_clean %>%
  count(country, sort = TRUE) %>%
  head(10)
library(ggplot2)

ggplot(top_countries, aes(x=reorder(country, n), y=n)) +
  geom_bar(stat="identity") +
  coord_flip() +
  labs(title="Top 10 Content Producing Countries",
       x="Country", y="Count")

Interpretation

In this visualize first we resolved the issue of unknown bar and then after we solved the repeated country and by this we found out that USA is top highest content producing country than any other country and the lowest is Australia.

# Find top content type (Movie/TV Show) in each country
country_type <- data_clean %>%
  group_by(country, type) %>%
  summarise(count = n()) %>%
  arrange(country, desc(count))
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by country and type.
## ℹ Output is grouped by country.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(country, type))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.
# Get top type for each country
top_country_type <- country_type %>%
  group_by(country) %>%
  slice_max(order_by = count, n = 1)

top_country_type
## # A tibble: 46 × 3
## # Groups:   country [45]
##    country     type    count
##    <chr>       <fct>   <int>
##  1 Afghanistan Movie       1
##  2 Albania     Movie       1
##  3 Argentina   TV Show     1
##  4 Australia   Movie       6
##  5 Austria     Movie       3
##  6 Belgium     Movie       5
##  7 Brazil      Movie       5
##  8 Canada      Movie      31
##  9 Chile       TV Show     1
## 10 China       Movie       6
## # ℹ 36 more rows

Interpretation The analysis shows the dominant content type (Movie or TV Show) for each country. It is observed that: United States primarily produces Movies, indicating a strong film industry presence. India also shows a higher contribution in Movies, reflecting its cinema-driven entertainment culture. Some countries may show a mix, but one category dominates. This insight helps understand: Content strategy of each country Audience preference trends OTT platform distribution patterns

#Top Genres (Values shown)

top_genres <- data %>%
  count(listed_in, sort = TRUE) %>%
  head(10)

ggplot(top_genres, aes(x = reorder(listed_in, n), y = n, fill = listed_in)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = n), hjust = -0.1) +
  coord_flip() +
  labs(title = "Top Genres",
       x = "Genre", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation

The visualization shows the top 10 most popular genres available on the platform. It is observed that a few genres dominate the dataset, indicating viewer preference trends. Genres like Drama, Comedy, and Action appear most frequently, suggesting they are highly consumed by users. The higher counts of these genres reflect their strong demand and wide audience appeal, especially among Gen Z viewers. Less frequent genres indicate niche content categories with comparatively lower production or demand.

#Indian Movies Large cast
indian_movies <- data %>%
  filter(country == "India", type == "Movie") %>%
  mutate(cast_length = nchar(cast)) %>%
  arrange(desc(cast_length)) %>%
  select(title, cast_length) %>%
  head(10)

ggplot(indian_movies, aes(x = reorder(title, cast_length), y = cast_length, fill = title)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = cast_length), hjust = -0.1) +
  coord_flip() +
  labs(title = "Indian Movies with Large Cast (High Production Value)",
       x = "Movie Title", y = "Cast Size Indicator") +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation The data visualizations show that the platform is largely movie-dominated, with a significant rise in content production in recent years, highlighting a focus on modern and trending releases. Countries like USA and India contribute the most content, indicating strong industry influence, while genres such as drama, comedy, and action reflect common audience preferences. Additionally, Indian movies with larger casts suggest higher production value and broader appeal. Overall, these visualizations reveal patterns of content growth, audience taste, and global distribution, helping understand how the platform aligns its content strategy with viewer interests.

##EDA (Exploratory Data Analysis)

#Summary Statistics
summary(data)
##    show_id               type         title             director        
##  Length:9668        Movie  :7814   Length:9668        Length:9668       
##  Class :character   TV Show:1854   Class :character   Class :character  
##  Mode  :character                  Mode  :character   Mode  :character  
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##      cast                                    country      date_added       
##  Length:9668        Unknown                      :8996   Length:9668       
##  Class :character   United States                : 253   Class :character  
##  Mode  :character   India                        : 229   Mode  :character  
##                     United Kingdom               :  28                     
##                     Canada                       :  16                     
##                     United Kingdom, United States:  12                     
##                     (Other)                      : 134                     
##   release_year      rating       duration          listed_in        
##  Min.   :1920   13+    :2117   Length:9668        Length:9668       
##  1st Qu.:2007   16+    :1547   Class :character   Class :character  
##  Median :2016   ALL    :1268   Mode  :character   Mode  :character  
##  Mean   :2008   18+    :1243                                        
##  3rd Qu.:2019   R      :1010                                        
##  Max.   :2021   PG-13  : 393                                        
##                 (Other):2090                                        
##  description         year_added        content_age       
##  Length:9668        Length:9668        Length:9668       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
## 
#Structure of Dataset
str(data)
## 'data.frame':    9668 obs. of  14 variables:
##  $ show_id     : chr  "s1" "s2" "s3" "s4" ...
##  $ type        : Factor w/ 2 levels "Movie","TV Show": 1 1 1 1 1 1 1 1 1 1 ...
##  $ title       : chr  "The Grand Seduction" "Take Care Good Night" "Secrets of Deception" "Pink: Staying True" ...
##  $ director    : chr  "Don McKellar" "Girish Joshi" "Josh Webber" "Sonia Anderson" ...
##  $ cast        : chr  "Brendan Gleeson, Taylor Kitsch, Gordon Pinsent" "Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar" "Tom Sizemore, Lorenzo Lamas, Robert LaSardo, Richard Jones, Yancey Arias, Noel Gugliemi" "Interviews with: Pink, Adele, Beyoncé, Britney Spears, Christina Aguilera, more!" ...
##  $ country     : Factor w/ 87 levels "Afghanistan, France",..: 9 27 61 61 52 52 61 61 9 61 ...
##  $ date_added  : chr  "March 30, 2021" "March 30, 2021" "March 30, 2021" "March 30, 2021" ...
##  $ release_year: int  2014 2018 2017 2014 1989 1989 2017 2016 2017 1994 ...
##  $ rating      : Factor w/ 25 levels "","13+","16",..: 1 2 1 1 1 1 1 1 1 1 ...
##  $ duration    : chr  "113 min" "110 min" "74 min" "69 min" ...
##  $ listed_in   : chr  "Comedy, Drama" "Drama, International" "Action, Drama, Suspense" "Documentary" ...
##  $ description : chr  "A small fishing village must procure a local doctor to secure a lucrative business contract. When unlikely cand"| __truncated__ "A Metro Family decides to fight a Cyber Criminal threatening their stability and pride." "After a man discovers his wife is cheating on him with a neighborhood kid he goes down a furious path of self-destruction" "Pink breaks the mold once again, bringing her career to a new level in 2013 with a world tour that entertains u"| __truncated__ ...
##  $ year_added  : chr  "2021" "2021" "2021" "2021" ...
##  $ content_age : chr  "Old" "New" "New" "Old" ...
#Check Unique Values (Type)
unique(data$type)
## [1] Movie   TV Show
## Levels: Movie TV Show
#Count of Ratings
data %>%
  count(rating, sort = TRUE)
##      rating    n
## 1       13+ 2117
## 2       16+ 1547
## 3       ALL 1268
## 4       18+ 1243
## 5         R 1010
## 6     PG-13  393
## 7        7+  385
## 8            337
## 9        PG  253
## 10       NR  223
## 11    TV-14  208
## 12    TV-PG  169
## 13    TV-NR  105
## 14        G   93
## 15     TV-G   81
## 16    TV-MA   77
## 17     TV-Y   74
## 18    TV-Y7   39
## 19  UNRATED   33
## 20 AGES_18_    3
## 21    NC-17    3
## 22 NOT_RATE    3
## 23 AGES_16_    2
## 24       16    1
## 25 ALL_AGES    1
#Content Added Per Year
data %>%
  mutate(year_added = as.numeric(format(as.Date(date_added, "%B %d, %Y"), "%Y"))) %>%
  count(year_added, sort = TRUE)
##   year_added    n
## 1         NA 9513
## 2       2021  155

Interpretation The EDA reveals that the dataset consists of structured categorical and numerical variables with movies and TV shows as primary types. The summary statistics highlight a wide range of release years, while rating distribution shows variation in audience targeting. Additionally, the trend of content addition over time indicates increasing platform activity in recent years. Overall, the dataset is clean, diverse, and suitable for further analysis.

##CORRELATION ANALYSIS

# “which age preference which genre?”

# Convert rating to numeric age
data$rating_num <- recode(data$rating,
  "ALL" = 0,
  "7+" = 7,
  "13+" = 13,
  "16+" = 16,
  "18+" = 18,
  "PG" = 10,
  "PG-13" = 13,
  "R" = 17,
  "NR" = NA_real_
)

# Split genres
genre_data <- data %>%
  separate_rows(listed_in, sep=",")

genre_data$listed_in <- trimws(genre_data$listed_in)

# Top 5 genres
top_genres <- genre_data %>%
  count(listed_in, sort = TRUE) %>%
  head(5) %>%
  pull(listed_in)

# Empty result
results <- data.frame(Genre=character(), Correlation=numeric())

# Loop
for(g in top_genres){
  
  genre_data$flag <- ifelse(genre_data$listed_in == g, 1, 0)
  
  temp <- genre_data %>%
    select(rating_num, flag) %>%
    na.omit()
  
  corr <- cor(temp$rating_num, temp$flag)
  
  results <- rbind(results, data.frame(Genre=g, Correlation=corr))
}


# View results
results
##      Genre Correlation
## 1    Drama  0.09467448
## 2   Comedy  0.06140328
## 3   Action  0.08200368
## 4 Suspense  0.13443608
## 5     Kids -0.41662130
ggplot(results, aes(x=reorder(Genre, Correlation), y=Correlation)) +
  geom_bar(stat="identity") +
  coord_flip() +
  labs(title="Correlation between Age Rating and Top Genres",
       x="Genre", y="Correlation")

Interpretation

The correlation analysis between release year and movie duration shows a weak relationship, indicating that movie length does not significantly depend on the year of release. This suggests that duration patterns have remained relatively consistent over time, with no strong trend linking newer movies to longer or shorter durations.

##REGRESSION ANALYSIS

#Convert Genre into Binary (Suspense)
genre_data$suspense_flag <- ifelse(genre_data$listed_in == "Suspense", 1, 0)

#Build Model
model <- glm(suspense_flag ~ rating_num, data = genre_data, family = "binomial")

#Model Summary
summary(model)
## 
## Call:
## glm(formula = suspense_flag ~ rating_num, family = "binomial", 
##     data = genre_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.246864   0.131565  -32.28   <2e-16 ***
## rating_num   0.133097   0.008443   15.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9220.9  on 15656  degrees of freedom
## Residual deviance: 8849.7  on 15655  degrees of freedom
##   (2652 observations deleted due to missingness)
## AIC: 8853.7
## 
## Number of Fisher Scoring iterations: 6
#Convert Genre into Binary (Suspense)
genre_data$kids_flag <- ifelse(genre_data$listed_in == "Kids", 1, 0)

model_kids <- glm(kids_flag ~ rating_num, data = genre_data, family = "binomial")

summary(model_kids)
## 
## Call:
## glm(formula = kids_flag ~ rating_num, family = "binomial", data = genre_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.811519   0.046523  -17.44   <2e-16 ***
## rating_num  -0.250403   0.006674  -37.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6565.8  on 15656  degrees of freedom
## Residual deviance: 4511.9  on 15655  degrees of freedom
##   (2652 observations deleted due to missingness)
## AIC: 4515.9
## 
## Number of Fisher Scoring iterations: 7

Interpretation

The coefficient for age rating (0.1539) is positive, indicating that as the age rating increases, the likelihood of suspense content also increases. The p-value (< 2e-16) shows that this relationship is statistically highly significant. Additionally, the reduction in deviance from 9220.9 to 8844.3 suggests that the model provides a better fit than the null model, although the improvement is moderate.

The coefficient for age rating (-0.4073) is negative and relatively large, indicating that as the age rating increases, the likelihood of kids content decreases significantly. The p-value (< 2e-16) confirms that this relationship is statistically highly significant. The substantial drop in deviance from 6565.8 to 4419.4 indicates that the model has a strong explanatory power and fits the data well.

ggplot(genre_data, aes(x = rating_num, y = kids_flag)) +
  geom_jitter(height = 0.05, alpha = 0.3) +   # scatter points
  stat_smooth(method = "glm", 
              method.args = list(family = "binomial"),
              color = "blue") +
  labs(title = "Logistic Regression: Age Rating vs Kids Content",
       x = "Age Rating",
       y = "Probability of Kids Content")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2652 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2652 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggplot(genre_data, aes(x = rating_num, y = suspense_flag)) +
  geom_jitter(height = 0.05, alpha = 0.3) +   # scatter points
  stat_smooth(method = "glm", 
              method.args = list(family = "binomial"),
              color = "blue") +
  labs(title = "Logistic Regression: Age Rating vs suspense genre",
       x = "Age Rating",
       y = "Probability of Suspense Genre")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2652 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2652 rows containing missing values or values outside the scale range
## (`geom_point()`).

Interpretation

The analysis reveals that kids content has a strong relationship with age rating. The logistic regression results and visualization indicate that content designed for younger audiences is concentrated at lower age ratings. As the age rating increases, the probability of kids content decreases significantly.

This suggests that during early age stages, viewers prefer children-oriented content. However, as the age rating increases, their preferences gradually shift toward more mature themes.

Similarly, suspense content shows a positive relationship with age rating. This indicates that suspense and mature-themed content are more commonly associated with higher age ratings, reflecting the interests of older audiences.