Introduction

Executive Summary

The world of the movie business is an area where there are a variety of variables to analyze and generate success in a series or movie for the next hit that people will be talking about, but you have to take into account the exact variables of a database and give a good analysis to see what strategy should be taken into account to take action.

Objectives

At the end of this evidence, several pre-established objectives will be resolved to provide greater transparency and quality data on what variables generate the success of a film, in order to reach this result:

  • Treatment to NAs

  • Information cleaning

  • Data organization

  • Information analysis

  • Answer to the main question

Body

Data description

First of all, it is to call the libraries, tools and database that we are going to realize to give different treatments, manipulate and organize information to have a visible result.

Libraries

#Calling Libraries
library(units)
## Warning: package 'units' was built under R version 4.3.3
## udunits database from C:/Users/Carlos/AppData/Local/R/win-library/4.3/units/share/udunits/udunits2.xml
library(visdat)
## Warning: package 'visdat' was built under R version 4.3.3
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 4.3.3
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(janitor)
## Warning: package 'janitor' was built under R version 4.3.3
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringr)
library(tidyr)
library(readr)
library(ggplot2)

Database

#Callling Database
movies <- read.csv("C:\\Users\\Carlos\\Desktop\\movies_metadata.csv")

Data cleaning conducted

Treatment of NAs

#Search how many NAs are in the database
sum(is.na(movies))
## [1] 281
# Search in each column how many NAs exist to see what type of elimination to apply
sapply(movies, function(x) sum(is.na(x)))
##                 adult belongs_to_collection                budget 
##                     0                     0                     0 
##                genres              homepage                    id 
##                     0                     0                     0 
##               imdb_id     original_language        original_title 
##                     0                     0                     0 
##              overview            popularity           poster_path 
##                     0                     0                     0 
##  production_companies  production_countries          release_date 
##                     0                     0                     0 
##               revenue               runtime      spoken_languages 
##                     6                   263                     0 
##                status               tagline                 title 
##                     0                     0                     0 
##                 video          vote_average            vote_count 
##                     0                     6                     6
# Cleaning and summary of column "runtime"
movies2 <- movies

movies2$runtime[is.na(movies2$runtime)] <- mean(movies2$runtime, na.rm=TRUE)
summary(movies2$runtime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   85.00   95.00   94.13  107.00 1256.00
# Cleaning and summary of column "revenue"
movies3 <- movies2

movies3$revenue[is.na(movies3$revenue)] <- mean(movies3$revenue, na.rm=TRUE)
summary(movies3$revenue)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 0.000e+00 1.121e+07 0.000e+00 2.788e+09
# Cleaning and summary of column "vote count"
movies4 <- movies3

movies4$vote_count[is.na(movies4$vote_count)] <- mean(movies4$vote_count, na.rm=TRUE)
summary(movies4$vote_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     3.0    10.0   109.9    34.0 14075.0
# Cleaning and summary of column "vote average"
movies_wo_na <- movies4

movies_wo_na$vote_average[is.na(movies_wo_na$vote_average)] <- mean(movies_wo_na$vote_average, na.rm=TRUE)
summary(movies_wo_na$vote_average)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.000   6.000   5.618   6.800  10.000
#Verifying NAs
sum(is.na(movies_wo_na))
## [1] 0

Reasoning

The method of elimination used for the NAs was by taking the average, which was decided for two reasons.
- Because in the analysis of the columns it was possible to obtain that the values that contained NA were numeric and integers runtime, vote average, vote count, revenue.
- The missing values are important and it is desired to preserve the general structure of the data.

Treatment of duplicates and partially duplicates

# Check how many information are totally duplicated
sum(duplicated(movies_wo_na)) 
## [1] 17
# Eliminate information that is totally duplicated
movies_wo_dup <- movies_wo_na
movies_wo_dup <- distinct(movies_wo_dup)
# Check if duplicate information is eliminated
sum(duplicated(movies_wo_dup))
## [1] 0
# Check information that could be partially duplicated
movies_wo_dup %>% 
  count(id) %>%
  filter(n > 1)
##        id n
## 1   10991 2
## 2  109962 2
## 3  110428 2
## 4   12600 2
## 5   13209 2
## 6  132641 2
## 7   14788 2
## 8   15028 2
## 9   22649 2
## 10   4912 2
## 11  69234 2
## 12  77221 2
## 13  84198 2
# Check the rows for specific duplicates

id_values <- c("10991")

selected_rows <- movies_wo_dup[movies_wo_dup$id %in% id_values, ]

selected_rows # Differ in popularity
##       adult
## 4114  False
## 44806 False
##                                                                                                                                     belongs_to_collection
## 4114  {'id': 34055, 'name': 'Pokémon Collection', 'poster_path': '/j5te0YNZAMXDBnsqTUDKIBEt8iu.jpg', 'backdrop_path': '/iGoYKA0TFfgSoZpG2u5viTJMGfK.jpg'}
## 44806 {'id': 34055, 'name': 'Pokémon Collection', 'poster_path': '/j5te0YNZAMXDBnsqTUDKIBEt8iu.jpg', 'backdrop_path': '/iGoYKA0TFfgSoZpG2u5viTJMGfK.jpg'}
##         budget
## 4114  16000000
## 44806 16000000
##                                                                                                                                                                 genres
## 4114  [{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 16, 'name': 'Animation'}, {'id': 28, 'name': 'Action'}, {'id': 10751, 'name': 'Family'}]
## 44806 [{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 16, 'name': 'Animation'}, {'id': 28, 'name': 'Action'}, {'id': 10751, 'name': 'Family'}]
##                                homepage    id   imdb_id original_language
## 4114  http://movies.warnerbros.com/pk3/ 10991 tt0235679                ja
## 44806 http://movies.warnerbros.com/pk3/ 10991 tt0235679                ja
##             original_title
## 4114  Pokémon 3: The Movie
## 44806 Pokémon 3: The Movie
##                                                                                                                                                                                                                                                                                                                                                         overview
## 4114  When Molly Hale's sadness of her father's disappearance get to her, she unknowingly uses the Unown to create her own dream world along with Entei, who she believes to be her father. When Entei kidnaps Ash's mom, Ash along with Misty & Brock invade the mansion looking for his mom and trying to stop the mysteries of Molly's Dream World and Entei!
## 44806 When Molly Hale's sadness of her father's disappearance get to her, she unknowingly uses the Unown to create her own dream world along with Entei, who she believes to be her father. When Entei kidnaps Ash's mom, Ash along with Misty & Brock invade the mansion looking for his mom and trying to stop the mysteries of Molly's Dream World and Entei!
##       popularity                      poster_path
## 4114   10.264597 /5ILjS6XB5deiHop8SXPsYxXWVPE.jpg
## 44806   6.480376 /5ILjS6XB5deiHop8SXPsYxXWVPE.jpg
##                                                                                                                                                                                                                                                                                                                               production_companies
## 4114  [{'name': 'TV Tokyo', 'id': 3034}, {'name': '4 Kids Entertainment', 'id': 3035}, {'name': 'Nintendo', 'id': 12288}, {'name': 'Pikachu Project', 'id': 12653}, {'name': 'Oriental Light and Magic (OLM)', 'id': 66289}, {'name': "Kids' WB", 'id': 89163}, {'name': 'Pikachu Project 2001', 'id': 89164}, {'name': 'Shoakuken', 'id': 89165}]
## 44806 [{'name': 'TV Tokyo', 'id': 3034}, {'name': '4 Kids Entertainment', 'id': 3035}, {'name': 'Nintendo', 'id': 12288}, {'name': 'Pikachu Project', 'id': 12653}, {'name': 'Oriental Light and Magic (OLM)', 'id': 66289}, {'name': "Kids' WB", 'id': 89163}, {'name': 'Pikachu Project 2001', 'id': 89164}, {'name': 'Shoakuken', 'id': 89165}]
##                          production_countries release_date  revenue runtime
## 4114  [{'iso_3166_1': 'JP', 'name': 'Japan'}]   2000-07-08 68411275      93
## 44806 [{'iso_3166_1': 'JP', 'name': 'Japan'}]   2000-07-08 68411275      93
##                               spoken_languages   status
## 4114  [{'iso_639_1': 'en', 'name': 'English'}] Released
## 44806 [{'iso_639_1': 'en', 'name': 'English'}] Released
##                             tagline                         title video
## 4114  Pokémon: Spell of the Unknown Pokémon: Spell of the Unknown False
## 44806 Pokémon: Spell of the Unknown Pokémon: Spell of the Unknown False
##       vote_average vote_count
## 4114             6        143
## 44806            6        144
# Let's see with another parcial duplicate

id_values <- c("109962")

selected_rows <- movies_wo_dup[movies_wo_dup$id %in% id_values, ]

selected_rows # Also differs in popularity
##       adult belongs_to_collection budget                        genres homepage
## 5710  False                            0 [{'id': 18, 'name': 'Drama'}]         
## 20895 False                            0 [{'id': 18, 'name': 'Drama'}]         
##           id   imdb_id original_language  original_title
## 5710  109962 tt0082992                en Rich and Famous
## 20895 109962 tt0082992                en Rich and Famous
##                                                                                               overview
## 5710  Two literary women compete for 20 years: one writes for the critics; the other one, to get rich.
## 20895 Two literary women compete for 20 years: one writes for the critics; the other one, to get rich.
##       popularity                      poster_path
## 5710   12.180836 /tOflyY8eUFWubLKJH7fKg4KwpCl.jpg
## 20895  10.396878 /tOflyY8eUFWubLKJH7fKg4KwpCl.jpg
##                                                                       production_companies
## 5710  [{'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}, {'name': 'Jaquet', 'id': 16778}]
## 20895 [{'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}, {'name': 'Jaquet', 'id': 16778}]
##                                             production_countries release_date
## 5710  [{'iso_3166_1': 'US', 'name': 'United States of America'}]   1981-09-23
## 20895 [{'iso_3166_1': 'US', 'name': 'United States of America'}]   1981-09-23
##       revenue runtime                         spoken_languages   status
## 5710        0     115 [{'iso_639_1': 'en', 'name': 'English'}] Released
## 20895       0     115 [{'iso_639_1': 'en', 'name': 'English'}] Released
##                                                                                                                     tagline
## 5710  From the very beginning, they knew they'd be friends to the end. What they didn't count on was everything in between.
## 20895 From the very beginning, they knew they'd be friends to the end. What they didn't count on was everything in between.
##                 title video vote_average vote_count
## 5710  Rich and Famous False          4.9          7
## 20895 Rich and Famous False          4.9          7
# Apply a condition where only those with the lower popularity stay

movies_wo_dup <- movies_wo_dup %>% group_by(id) %>% filter(popularity == min(popularity))

movies_wo_dup %>% 
  count(id) %>%
  filter(n > 1) # Verification for no partial duplicates
## # A tibble: 0 × 2
## # Groups:   id [0]
## # ℹ 2 variables: id <chr>, n <int>

- Reasoning -

When collecting information there may be missing data such as NAs, but there is also information that is repeated either partially or totally, for each situation it is important to give a different treatment, in the case of the totally duplicated data we first checked the database that had already given treatment to the NAs, then we used the function distinct that its mechanics is the elimination of rows in which it sees that there is duplicated information and at the end we verified that the duplicated rows have been eliminated.

In the case of the partially duplicated rows, something similar was done but with the exception of using the count function on id to see how many of these are duplicated because there cannot be two of the same id and they differ in popularity, so the filter function was used to eliminate the partially duplicated rows.

Organization of information

Genres

# Splitting each string into a character vector, returns a df

# Use a regular expression to match the genre names
genre_pattern <- "(?<=name': ')[^']*(?=')"

# Apply the extraction to the entire column and store the result in a new object
genre_data <- lapply(movies_wo_dup$genres, function(x) {
  genres <- str_extract_all(x, genre_pattern)[[1]]
  split_genres <- strsplit(genres, ", ")
  unlist(split_genres)
})
# Calculate the maximum number of genres a movie can have
max_genre <- max(sapply(genre_data, length))

# Convert the list to a data frame
genre_df <- do.call(rbind, lapply(genre_data, function(x) {
  tmp <- rep(NA, max_genre)
  tmp[seq_along(x)] <- x
  as.data.frame(t(tmp))
}))

# Rename the columns
names(genre_df) <- paste0("genre", seq_len(ncol(genre_df)))
# Join the new genre data frame with the original data frame
movies_wo_dup <- cbind(movies_wo_dup, genre_df)

# Delete the genres column
movies_w_genres <- subset(movies_wo_dup, select = -genres)

Production Companies

# Feature to clear production company names
clean_production_companies <- function(companies) {
  # Apply regular expression to extract company name
  cleaned_names <- str_extract(companies, "'name': '([^']+)'")
  # Delete the extra part
  cleaned_names <- gsub("'name': '", "", cleaned_names)
  cleaned_names <- gsub("'", "", cleaned_names)
  return(cleaned_names)
}

# Apply the function to the production_companies column and store the results in a new column
movies_w_genres$Clean_production_companies <- clean_production_companies(movies_w_genres$production_companies)

# Delete the production companies column
movies_w_prodcom <- subset(movies_w_genres, select = -production_companies)

Production Countries

# Function to clear the names of the countries of production
clean_production_countries <- function(countries) {
  # Apply regular expression to extract country name
  cleaned_names <- str_extract(countries, "'name': '([^']+)'")
  # Delete the extra part
  cleaned_names <- gsub("'name': '", "", cleaned_names)
  cleaned_names <- gsub("'", "", cleaned_names)
  return(cleaned_names)
}

# Apply the function to the production_countries column and store the results in a new column
movies_w_prodcom$Clean_production_countries <- clean_production_countries(movies_w_prodcom$production_countries)

# Delete the production companies column
movies_w_prodcoun <- subset(movies_w_prodcom, select = -production_countries)

Spoken Languages

# Use a regular expression to match the language names
language_pattern <- "(?<=name': ')[^']*(?=')"

# Apply the extraction to the entire column and store the result in a new object
language_data <- lapply(movies_w_prodcoun$spoken_languages, function(x) {
  languages <- str_extract_all(x, language_pattern)[[1]]
  split_languages <- strsplit(languages, ", ")
  unlist(split_languages)
})
# Calculate the maximum number of languages a movie can have
max_language <- max(sapply(language_data, length))

# Convert the list to a data frame
language_df <- do.call(rbind, lapply(language_data, function(x) {
  tmp <- rep(NA, max_language)
  tmp[seq_along(x)] <- x
  as.data.frame(t(tmp))
}))

# Rename the columns
names(language_df) <- paste("spoken_lang", seq_len(ncol(language_df)), sep = "_")
# Join the new language data frame with the original data frame
movies_w_lang <- cbind(movies_w_prodcoun, language_df)

# Eliminate original column
movies_w_lang$spoken_languages <- NULL

Belongs to collection

# Function to separate the 'belongs_to_collection' column
separate_belongs_to_collection <- function(movies_w_lang) {
  # Create a new DataFrame to perform separation
  collection_data <- movies_w_lang

  # Apply regular expression to extract collection name
  collection_names <- str_extract(collection_data$belongs_to_collection, "'name': '([^']+)'")

  # Delete the extra part
  collection_names <- gsub("'name': '", "", collection_names)
  collection_names <- gsub("'", "", collection_names)

  # Create a new column in the new DataFrame with the extracted collection names
  collection_data$Collection <- collection_names

  # Remove the original 'belongs_to_collection' column from the new DataFrame
  collection_data$belongs_to_collection <- NULL

  return(collection_data)
}

# Create a new DataFrame for separation
collection_data <- movies_w_lang

# Apply the function to the new DataFrame
collection_data <- separate_belongs_to_collection(collection_data)

- Reasoning -

Several actions were carried out to better organize the information we are dealing with because we have to separate, correct and place text information for the dataframe so that it makes sense and so that functions or tables of answers can be applied to give greater clarity, several dataframes were created with new columns in which the language , generos , belong to collection , clean production companies and clean production countries were treated and the new information was put together in a new dataframe called collection_data.

Identifying out of range data

Out of value runtime

summary(collection_data$runtime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   85.00   95.00   94.12  107.00 1256.00

Out of value revenue

summary(collection_data$revenue)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 0.000e+00 1.121e+07 0.000e+00 2.788e+09

Out of value vote count

summary(collection_data$vote_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     3.0    10.0   109.9    34.0 14075.0

Out of value vote average

# Vote Average histogram
histogram_vote_average <- ggplot(collection_data, aes(x = vote_average)) +
  geom_histogram(fill = "black", color = "red") +
  labs(title = "Histogram of Vote Average")

# Display histogram 
print(histogram_vote_average)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Out of value Budget

collection_data$budget <- as.integer(collection_data$budget)
## Warning: NAs introducidos por coerción
# Create the histogram
histogram_budget <- ggplot(collection_data, aes(x = budget)) +
  geom_histogram(fill = "black", color = "blue") +
  labs(title = "Histogram of Budget")

# Display the histogram
print(histogram_budget)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

### - Reasoning -

We took into account the analysis of the out of range values of collection_data to know if a treatment of such data is needed to see if it coincides with its own nature and that in turn makes sense such information.

Which in this case should only have the visibility of them but not the elimination because this may alter the data set and that in turn are intiger class data which in turn makes sense that certain values can be radically different from the others.

Converting information

Factors

# Identify categorical variables and convert them to factors
cols_to_factor <- c("adult", paste0("genre", 1:8), "Clean_production_companies", 
                    "Clean_production_countries", paste0("spoken_lang_", 1:18))

# # Convert columns to factors
collection_data[cols_to_factor] <- lapply(collection_data[cols_to_factor], factor)

Intiger & Numeric

#Verfying columns that has numbers that matches with their category
class(collection_data$budget)
## [1] "integer"
class(collection_data$id)
## [1] "character"
class(collection_data$popularity)
## [1] "character"
class(collection_data$revenue)
## [1] "numeric"
class(collection_data$vote_average)
## [1] "numeric"
class(collection_data$vote_count)
## [1] "numeric"
#Treatment for popularity
collection_data$popularity <- as.numeric(collection_data$popularity)
## Warning: NAs introducidos por coerción
# Round popularity to whole numbers
collection_data$round_popularity <- round(collection_data$popularity)

# Show results
class(collection_data$popularity)
## [1] "numeric"

- Reasoning -

It was verified that from collection_data the columns had their correct class and those that did not were given a treatment to manipulate the information so that it was correct.

Applying tool Data explorer

report_data <- collection_data[-c(22:27, 32:47)]
#create_report(collection_data, y = "revenue")
#create_report(report_data, y = "revenue")

- Reasoning -

After visualizing the information, an alarming detail was the excessive amount of NAs, which was 49%, so it is not possible to give an exact interpretation of what we are looking for, so we will eliminate the columns that have enough lack of information and once that is done we can give more accurate conclusions.

Deleting columns of genre

collection_data <- subset(collection_data, select = -c(22:27))

Deleting columns of speak lang

collection_data <- subset(collection_data, select = -c(26:41))

Verfying NAs

# Instalar la biblioteca
#install.packages("heatmaply")

# Cargar la biblioteca
#library(heatmaply)

# Visualizar los datos faltantes
#heatmaply_na(collection_data, showticklabels = c(TRUE, FALSE))

Hi teacher, I just wanna to comment you that in this part of the evidence I try to use a different library as is mention in the evidence but I dont why it runs the library and function but it takes to much time, but what I try to do is to replace the library of visdat to see the the missing information with a different table I saw it on google and I like it.

Conclusion

Results

Bar Chart

# Create a backup df
copy_data <- collection_data
# Calculate the frequencies of each production company
company_frequencies <- copy_data %>%
  count(Clean_production_companies) %>%
  arrange(desc(n))  # Sort in descending order of frequency

# Select the top 10 production companies
top_10_companies <- head(company_frequencies, 10)

top_10_companies
## # A tibble: 10 × 3
## # Groups:   id [10]
##    id     Clean_production_companies                        n
##    <chr>  <fct>                                         <int>
##  1 100    Handmade Films Ltd.                               1
##  2 10000  Ministère de la Culture et de la Francophonie     1
##  3 10001  Warner Bros.                                      1
##  4 100010 Metro-Goldwyn-Mayer (MGM)                         1
##  5 100017 MMM Film Zimmermann & Co                          1
##  6 10002  The Criterion Collection                          1
##  7 100024 <NA>                                              1
##  8 10003  Paramount Pictures                                1
##  9 100032 Von Zerneck Sertner Films                         1
## 10 100033 Shochiku Ofuna                                    1
# Calculate the count of each company
company_counts <- table(copy_data$Clean_production_companies)

# Sort the companies by count in descending order and select the top five
top_companies <- names(sort(company_counts, decreasing = TRUE)[1:8])

# Filter the dataset to include only the top five most important companies
filtered_data <- copy_data[collection_data$Clean_production_companies %in% top_companies, ]

# Create the bar plot
ggplot(filtered_data, aes(x = Clean_production_companies)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Density Plot

# Calculate the count of each country
country_counts <- table(collection_data$Clean_production_countries)

# Sort the countries by count in descending order and select the top five
top_countries <- names(sort(country_counts, decreasing = TRUE)[1:8])

# Filter the dataset to include only the top five most important countries
filtered_data <- collection_data[collection_data$Clean_production_countries %in% top_countries, ]

# Create the density plot
ggplot(filtered_data, aes(x = Clean_production_countries)) +
  geom_density() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Scatter Graph

ggplot(data = collection_data, aes(x = collection_data$genre2, y = collection_data$popularity)) +
  geom_point() +
  labs(title = "Relation beteween genre2 and popularity", x = "genre2", y = "popularity") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_point()`).

Interpretation of the results

By conducting a long research of information different types of graphs were used in the conclusion to give a visualization of the data and the frequency in which it is used, what this can tell us that using the variable that has the highest frequency what this would guarantee that it will be successful film.

The first table the most successful company is Paramount Pictures, this may be due to its good management, logistics or development of films and this attracts the public reaching the value 1000

In the density plot shows us that Canada is the country that has the highest Density in terms of film viewing which means that we must pay attention to what the public wants from Canada, of course we also take into account the other countries, but we must take the most for Canada because its value is 1.8.

Finally, the genre that has the greatest recognition and success is animation, which has a value of more than 500

This leads us as a conclusion to take these variables and merge them to have a successful result.

Reference

The movies dataset. (2017, 10 noviembre). Kaggle. https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset

---
title: "Evidence"
author: "Dario Gallardo"
date: "2024-05-03"
output:
  html_document:
    toc: true
    toc_float: true
    code_download: true
    theme: lumen
---

# <span style = "color: purple;"> Introduction </span>

![](C:\\Users\\Carlos\\Downloads\\film-102681_1280.jpg)

## <span style = "color: green;"> Executive Summary </span>

The world of the movie business is an area where there are a variety of variables to analyze and generate success in a series or movie for the next hit that people will be talking about, but you have to take into account the exact variables of a *database* and give a good analysis to see what strategy should be taken into account to take action.


## <span style = "color: green;"> Objectives </span>

At the end of this evidence, several pre-established objectives will be resolved to provide greater transparency and quality data on what variables generate the success of a film, in order to reach this result:

- Treatment to NAs

- Information cleaning

- Data organization

- Information analysis

- Answer to the main question


# <span style = "color: purple;"> Body </span>

## <span style = "color: green;"> Data description </span>

First of all, it is to call the libraries, tools and database that we are going to realize to give different **treatments**, **manipulate** and **organize** information to have a visible result.

### <span style = "color: black;"> Libraries </span>

```{r}
#Calling Libraries
library(units)
library(visdat)
library(DataExplorer)
library(dplyr)
library(janitor)
library(tidyverse)
library(stringr)
library(tidyr)
library(readr)
library(ggplot2)
```
### <span style = "color: black;"> Database </span>
```{r}
#Callling Database
movies <- read.csv("C:\\Users\\Carlos\\Desktop\\movies_metadata.csv")
```

## <span style = "color: green;"> Data cleaning conducted </span>

### <span style = "color: black;"> Treatment of NAs </span>
```{r}
#Search how many NAs are in the database
sum(is.na(movies))
```
```{r}
# Search in each column how many NAs exist to see what type of elimination to apply
sapply(movies, function(x) sum(is.na(x)))
```

```{r}
# Cleaning and summary of column "runtime"
movies2 <- movies

movies2$runtime[is.na(movies2$runtime)] <- mean(movies2$runtime, na.rm=TRUE)
summary(movies2$runtime)
```

```{r}
# Cleaning and summary of column "revenue"
movies3 <- movies2

movies3$revenue[is.na(movies3$revenue)] <- mean(movies3$revenue, na.rm=TRUE)
summary(movies3$revenue)
```

```{r}
# Cleaning and summary of column "vote count"
movies4 <- movies3

movies4$vote_count[is.na(movies4$vote_count)] <- mean(movies4$vote_count, na.rm=TRUE)
summary(movies4$vote_count)
```

```{r}
# Cleaning and summary of column "vote average"
movies_wo_na <- movies4

movies_wo_na$vote_average[is.na(movies_wo_na$vote_average)] <- mean(movies_wo_na$vote_average, na.rm=TRUE)
summary(movies_wo_na$vote_average)
```
```{r}
#Verifying NAs
sum(is.na(movies_wo_na))
```
### <span style = "color: black;"> Reasoning </span>
The method of elimination used for the NAs was by **taking the average**, which was decided for two reasons.  
- Because in the analysis of the columns it was possible to obtain that the values that contained NA were numeric and integers *runtime*, *vote average*, *vote count*, *revenue*.  
- The missing values are important and it is desired to preserve the general structure of the data.  

### <span style = "color: black;"> Treatment of duplicates and partially duplicates </span>
```{r}
# Check how many information are totally duplicated
sum(duplicated(movies_wo_na)) 
```

```{r}
# Eliminate information that is totally duplicated
movies_wo_dup <- movies_wo_na
movies_wo_dup <- distinct(movies_wo_dup)
```

```{r}
# Check if duplicate information is eliminated
sum(duplicated(movies_wo_dup))
```


```{r}
# Check information that could be partially duplicated
movies_wo_dup %>% 
  count(id) %>%
  filter(n > 1)
```
```{r}
# Check the rows for specific duplicates

id_values <- c("10991")

selected_rows <- movies_wo_dup[movies_wo_dup$id %in% id_values, ]

selected_rows # Differ in popularity
```
```{r}
# Let's see with another parcial duplicate

id_values <- c("109962")

selected_rows <- movies_wo_dup[movies_wo_dup$id %in% id_values, ]

selected_rows # Also differs in popularity
```

```{r}
# Apply a condition where only those with the lower popularity stay

movies_wo_dup <- movies_wo_dup %>% group_by(id) %>% filter(popularity == min(popularity))

movies_wo_dup %>% 
  count(id) %>%
  filter(n > 1) # Verification for no partial duplicates

```
### <span style = "color: black;"> - Reasoning - </span>
When collecting information there may be missing data such as NAs, but there is also information that is repeated either partially or totally, for each situation it is important to give a different treatment, in the case of the totally duplicated data we first checked the database that had already given treatment to the NAs, then we used the function **distinct** that its mechanics is the elimination of rows in which it sees that there is duplicated information and at the end we verified that the duplicated rows have been eliminated. 

In the case of the partially duplicated rows, something similar was done but with the exception of using the **count** function on *id* to see how many of these are duplicated because there cannot be two of the same *id* and they differ in popularity, so the **filter** function was used to eliminate the partially duplicated rows.

## <span style = "color: green;"> Organization of information </span>

### <span style = "color: black;">Genres</span>
```{r}
# Splitting each string into a character vector, returns a df

# Use a regular expression to match the genre names
genre_pattern <- "(?<=name': ')[^']*(?=')"

# Apply the extraction to the entire column and store the result in a new object
genre_data <- lapply(movies_wo_dup$genres, function(x) {
  genres <- str_extract_all(x, genre_pattern)[[1]]
  split_genres <- strsplit(genres, ", ")
  unlist(split_genres)
})
```

```{r}
# Calculate the maximum number of genres a movie can have
max_genre <- max(sapply(genre_data, length))

# Convert the list to a data frame
genre_df <- do.call(rbind, lapply(genre_data, function(x) {
  tmp <- rep(NA, max_genre)
  tmp[seq_along(x)] <- x
  as.data.frame(t(tmp))
}))

# Rename the columns
names(genre_df) <- paste0("genre", seq_len(ncol(genre_df)))
```

```{r}
# Join the new genre data frame with the original data frame
movies_wo_dup <- cbind(movies_wo_dup, genre_df)

# Delete the genres column
movies_w_genres <- subset(movies_wo_dup, select = -genres)
```

### <span style = "color: black;">Production Companies</span>
```{r}
# Feature to clear production company names
clean_production_companies <- function(companies) {
  # Apply regular expression to extract company name
  cleaned_names <- str_extract(companies, "'name': '([^']+)'")
  # Delete the extra part
  cleaned_names <- gsub("'name': '", "", cleaned_names)
  cleaned_names <- gsub("'", "", cleaned_names)
  return(cleaned_names)
}

# Apply the function to the production_companies column and store the results in a new column
movies_w_genres$Clean_production_companies <- clean_production_companies(movies_w_genres$production_companies)

# Delete the production companies column
movies_w_prodcom <- subset(movies_w_genres, select = -production_companies)
```

### <span style = "color: black;">Production Countries</span>
```{r}
# Function to clear the names of the countries of production
clean_production_countries <- function(countries) {
  # Apply regular expression to extract country name
  cleaned_names <- str_extract(countries, "'name': '([^']+)'")
  # Delete the extra part
  cleaned_names <- gsub("'name': '", "", cleaned_names)
  cleaned_names <- gsub("'", "", cleaned_names)
  return(cleaned_names)
}

# Apply the function to the production_countries column and store the results in a new column
movies_w_prodcom$Clean_production_countries <- clean_production_countries(movies_w_prodcom$production_countries)

# Delete the production companies column
movies_w_prodcoun <- subset(movies_w_prodcom, select = -production_countries)
```

### <span style = "color: black;">Spoken Languages</span>
```{r}
# Use a regular expression to match the language names
language_pattern <- "(?<=name': ')[^']*(?=')"

# Apply the extraction to the entire column and store the result in a new object
language_data <- lapply(movies_w_prodcoun$spoken_languages, function(x) {
  languages <- str_extract_all(x, language_pattern)[[1]]
  split_languages <- strsplit(languages, ", ")
  unlist(split_languages)
})
```

```{r}
# Calculate the maximum number of languages a movie can have
max_language <- max(sapply(language_data, length))

# Convert the list to a data frame
language_df <- do.call(rbind, lapply(language_data, function(x) {
  tmp <- rep(NA, max_language)
  tmp[seq_along(x)] <- x
  as.data.frame(t(tmp))
}))

# Rename the columns
names(language_df) <- paste("spoken_lang", seq_len(ncol(language_df)), sep = "_")
```

```{r}
# Join the new language data frame with the original data frame
movies_w_lang <- cbind(movies_w_prodcoun, language_df)

# Eliminate original column
movies_w_lang$spoken_languages <- NULL
```

### <span style = "color: black;">Belongs to collection</span>
```{r}
# Function to separate the 'belongs_to_collection' column
separate_belongs_to_collection <- function(movies_w_lang) {
  # Create a new DataFrame to perform separation
  collection_data <- movies_w_lang

  # Apply regular expression to extract collection name
  collection_names <- str_extract(collection_data$belongs_to_collection, "'name': '([^']+)'")

  # Delete the extra part
  collection_names <- gsub("'name': '", "", collection_names)
  collection_names <- gsub("'", "", collection_names)

  # Create a new column in the new DataFrame with the extracted collection names
  collection_data$Collection <- collection_names

  # Remove the original 'belongs_to_collection' column from the new DataFrame
  collection_data$belongs_to_collection <- NULL

  return(collection_data)
}

# Create a new DataFrame for separation
collection_data <- movies_w_lang

# Apply the function to the new DataFrame
collection_data <- separate_belongs_to_collection(collection_data)
```
### <span style = "color: black;"> - Reasoning - </span>

Several actions were carried out to better organize the information we are dealing with because we have to **separate**, **correct** and **place** text information for the **dataframe** so that it makes sense and so that functions or tables of answers can be applied to give greater clarity, several **dataframes** were created with new columns in which the *language* , *generos* , *belong to collection* , *clean production companies* and *clean production countries* were treated and the new information was put together in a new dataframe called **collection_data**.

## <span style = "color: green;"> Identifying out of range data </span>

### <span style = "color: black;"> Out of value runtime </span>
```{r}
summary(collection_data$runtime)
```

### <span style = "color: black;"> Out of value revenue </span>
```{r}
summary(collection_data$revenue)
```
### <span style = "color: black;"> Out of value vote count </span>
```{r}
summary(collection_data$vote_count)
```
### <span style = "color: black;"> Out of value vote average </span>
```{r}
# Vote Average histogram
histogram_vote_average <- ggplot(collection_data, aes(x = vote_average)) +
  geom_histogram(fill = "black", color = "red") +
  labs(title = "Histogram of Vote Average")

# Display histogram 
print(histogram_vote_average)
```

### <span style = "color: black;"> Out of value Budget </span>
```{r}
collection_data$budget <- as.integer(collection_data$budget)

# Create the histogram
histogram_budget <- ggplot(collection_data, aes(x = budget)) +
  geom_histogram(fill = "black", color = "blue") +
  labs(title = "Histogram of Budget")

# Display the histogram
print(histogram_budget)
```
### <span style = "color: black;"> - Reasoning - </span>

We took into account the analysis of the out of range values of **collection_data** to know if a treatment of such data is needed to see if it coincides with its own nature and that in turn makes sense such information.

Which in this case should only have the visibility of them but not the elimination because this may alter the data set and that in turn are *intiger* class data which in turn makes sense that certain values can be radically different from the others.

## <span style = "color: green;"> Converting information </span>

### <span style = "color: black;"> Factors </span>

```{r}
# Identify categorical variables and convert them to factors
cols_to_factor <- c("adult", paste0("genre", 1:8), "Clean_production_companies", 
                    "Clean_production_countries", paste0("spoken_lang_", 1:18))

# # Convert columns to factors
collection_data[cols_to_factor] <- lapply(collection_data[cols_to_factor], factor)
```

### <span style = "color: black;"> Intiger & Numeric </span>
```{r}
#Verfying columns that has numbers that matches with their category
class(collection_data$budget)
class(collection_data$id)
class(collection_data$popularity)
class(collection_data$revenue)
class(collection_data$vote_average)
class(collection_data$vote_count)
```
```{r}
#Treatment for popularity
collection_data$popularity <- as.numeric(collection_data$popularity)

# Round popularity to whole numbers
collection_data$round_popularity <- round(collection_data$popularity)

# Show results
class(collection_data$popularity)
```
### <span style = "color: black;"> - Reasoning - </span>

It was verified that from **collection_data** the columns had their correct *class* and those that did not were given a treatment to manipulate the information so that it was correct.

## <span style = "color: green;">Applying tool Data explorer </span>

```{r}
report_data <- collection_data[-c(22:27, 32:47)]
#create_report(collection_data, y = "revenue")
#create_report(report_data, y = "revenue")
```
### <span style = "color: black;"> - Reasoning - </span>

After visualizing the information, an alarming detail was the excessive amount of NAs, which was *49%*, so it is not possible to give an exact interpretation of what we are looking for, so we will eliminate the columns that have enough lack of information and once that is done we can give more accurate conclusions.

### <span style = "color: black;"> Deleting columns of genre  </span>
```{r}
collection_data <- subset(collection_data, select = -c(22:27))
```

### <span style = "color: black;">  Deleting columns of speak lang  </span>
```{r}
collection_data <- subset(collection_data, select = -c(26:41))
```

### <span style = "color: black;"> Verfying NAs  </span>
```{r}
# Instalar la biblioteca
#install.packages("heatmaply")

# Cargar la biblioteca
#library(heatmaply)

# Visualizar los datos faltantes
#heatmaply_na(collection_data, showticklabels = c(TRUE, FALSE))
```
Hi teacher, I just wanna to comment you that in this part of the evidence I try to use a different library as is mention in the evidence but I dont why it runs the library and function but it takes to much time, but what I try to do is to replace the library of visdat to see the the missing information with a different table I saw it on google and I like it. 

# <span style = "color: purple;"> Conclusion </span>

## <span style = "color: green;"> Results </span>

### <span style = "color: black;"> Bar Chart  </span>

```{r}
# Create a backup df
copy_data <- collection_data
```

```{r}
# Calculate the frequencies of each production company
company_frequencies <- copy_data %>%
  count(Clean_production_companies) %>%
  arrange(desc(n))  # Sort in descending order of frequency

# Select the top 10 production companies
top_10_companies <- head(company_frequencies, 10)

top_10_companies
```

```{r}
# Calculate the count of each company
company_counts <- table(copy_data$Clean_production_companies)

# Sort the companies by count in descending order and select the top five
top_companies <- names(sort(company_counts, decreasing = TRUE)[1:8])

# Filter the dataset to include only the top five most important companies
filtered_data <- copy_data[collection_data$Clean_production_companies %in% top_companies, ]

# Create the bar plot
ggplot(filtered_data, aes(x = Clean_production_companies)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```

### <span style = "color: black;"> Density Plot  </span>
```{r}
# Calculate the count of each country
country_counts <- table(collection_data$Clean_production_countries)

# Sort the countries by count in descending order and select the top five
top_countries <- names(sort(country_counts, decreasing = TRUE)[1:8])

# Filter the dataset to include only the top five most important countries
filtered_data <- collection_data[collection_data$Clean_production_countries %in% top_countries, ]

# Create the density plot
ggplot(filtered_data, aes(x = Clean_production_countries)) +
  geom_density() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```

### <span style = "color: black;"> Scatter Graph  </span>
```{r}
ggplot(data = collection_data, aes(x = collection_data$genre2, y = collection_data$popularity)) +
  geom_point() +
  labs(title = "Relation beteween genre2 and popularity", x = "genre2", y = "popularity") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```

### <span style = "color: black;"> Interpretation of the results  </span>

By conducting a long research of information different types of graphs were used in the conclusion to give a visualization of the data and the frequency in which it is used, what this can tell us that using the variable that has the highest frequency what this would guarantee that it will be successful film.

The first table the most successful company is Paramount Pictures, this may be due to its good management, logistics or development of films and this attracts the public reaching the value 1000

In the density plot shows us that Canada is the country that has the highest Density in terms of film viewing which means that we must pay attention to what the public wants from Canada, of course we also take into account the other countries, but we must take the most for Canada because its value is 1.8.

Finally, the genre that has the greatest recognition and success is animation, which has a value of more than 500

This leads us as a conclusion to take these variables and merge them to have a successful result.

### <span style = "color: black;"> Reference </span>

The movies dataset. (2017, 10 noviembre). Kaggle. https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset







