Executive summmary

The movies dataset from Kaggle provides a long list of movies from several different countries. Other than the names of the movies, the dataset includes the run time, original language, status, tagline, etc, and our most important variables, revenue and budget. We believe these are our critical values because with these we can create, or mutate a new column called profit. With profit declared, we can begin to analyze our problem setup question:

What makes a movie successful or unsuccessful?

We analyzed the data using statistical tools like Pearson’s Coefficient of Correlation to compare various variables.
1. Words in the title - Profit, r^2 = 0.03, “Extremely low/null”
2. Budget - Profit, r^2 = 0.48, “Moderate”
3. Revenue - Profit, r^2 = 0.97, “Very strong”
3. Popularity - Profit, r^2 = “NULL”
4. Run time(<250minutes) - Profit, r^2 = “NULL”

The years with the most profit.
1. 1902
2. 2017
3. 1905,
while the lowest was 1988.

The genres with the most profit
1. Adventure
2. Fantasy
3. Animation,
while the lowest was History.

Vote Averages with the most revenue
1. 7.9
2. 8.1
3. 9.1
Interestingly, 10.0 vote is in 43rd place.

Word count in titles with the most profit
1. 10
2. 8
3. 7

Introduction.

This evidence is focused on the analysis of the Kaggle database (movie database), this dataset provided us with certain variables that we will be working with such as the duration of the movies, their income, budgets, etc. . Our main objective would be to know the reason why films fail or become commercially successful. This can be measured with the profit variable that is calculated as the difference between income and budget. In order to achieve this, we will have to use certain data cleaning and preparation techniques, which will be followed by extensive analysis. In addition to examining trends by genre and year, in order to recognize the patterns of success, with the purpose of understanding what elements contribute to the commercial success of films within the film industry.

Objectives.

The creation of our profit variable with the intention of being able to determine what the profit of each film would be.
Doing a genre and year analysis in order to determine which genres and years were the most profitable.
Clean and prepare the data to be able to handle missing values and errors, in order to make sure that the data is in the right format for analysis.
Run an in-depth analysis to recognize what patterns, anomalies and trends are in the data.

Data Description

To answer the question of what makes a movie successful or unsuccesful, we first cleaned the data using functions like - filter
- as.factor/character/numeric etc.
- vis_miss
- group_by
- mutate

First, we read the CSV provided for the problem setup.

moviespersonal <- read.csv("C:\\Users\\Usuario\\Downloads\\Movies dataset\\movies_metadata.csv")

Calling Packages

We call the packages we plan to use to clean the data provided.

#install.packages("jsonlite")
library("dplyr", quietly = TRUE)
library("stringr", quietly = TRUE)

# We also install the packages we will use to create graphs later
#install.packages("esquisse")
#install.packages("ggplot2")
library("ggplot2", quietly = TRUE)
#install.packages("tidyverse")
library("tidyverse", quietly = TRUE)

Summary observations

We run a summary on the dataset to see if we can identify anything that we need to “clean”

summary(moviespersonal)

##     adult           belongs_to_collection    budget             genres         
##  Length:45466       Length:45466          Length:45466       Length:45466      
##  Class :character   Class :character      Class :character   Class :character  
##  Mode  :character   Mode  :character      Mode  :character   Mode  :character  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##    homepage              id              imdb_id          original_language 
##  Length:45466       Length:45466       Length:45466       Length:45466      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  original_title       overview          popularity        poster_path       
##  Length:45466       Length:45466       Length:45466       Length:45466      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  production_companies production_countries release_date      
##  Length:45466         Length:45466         Length:45466      
##  Class :character     Class :character     Class :character  
##  Mode  :character     Mode  :character     Mode  :character  
##                                                              
##                                                              
##                                                              
##                                                              
##     revenue             runtime        spoken_languages      status         
##  Min.   :0.000e+00   Min.   :   0.00   Length:45466       Length:45466      
##  1st Qu.:0.000e+00   1st Qu.:  85.00   Class :character   Class :character  
##  Median :0.000e+00   Median :  95.00   Mode  :character   Mode  :character  
##  Mean   :1.121e+07   Mean   :  94.13                                        
##  3rd Qu.:0.000e+00   3rd Qu.: 107.00                                        
##  Max.   :2.788e+09   Max.   :1256.00                                        
##  NA's   :6           NA's   :263                                            
##    tagline             title              video            vote_average   
##  Length:45466       Length:45466       Length:45466       Min.   : 0.000  
##  Class :character   Class :character   Class :character   1st Qu.: 5.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 5.618  
##                                                           3rd Qu.: 6.800  
##                                                           Max.   :10.000  
##                                                           NA's   :6       
##    vote_count     
##  Min.   :    0.0  
##  1st Qu.:    3.0  
##  Median :   10.0  
##  Mean   :  109.9  
##  3rd Qu.:   34.0  
##  Max.   :14075.0  
##  NA's   :6

At first glance, we see a couple of columns that are in the wrong data class.
adult, character -> factor. (True/False)
budget, chr -> numeric/integer
id, chr -> integer
popularity, chr -> numeric
release_date, chr -> date
revenue, chr -> numeric
runtime, chr -> numeric
status, chr -> factor (Released/Unreleased)

We need to change these into their correct class to avoid any problem in the analysis.

Furthermore, we see some outliers in the data that don’t allow for a correct analysis.
Lines 19731, 29504, and 35588 have errors in every single column. Seeing that we have 45,466 rows of data, our findings will not be affected if we remove these 3 errors.

moviespersonal <- moviespersonal[-c(19731, 29504, 35588), ]

Changing Data Types

Once this is done, we can finally turn the columns into their correct data class.

moviespersonal$adult <- as.factor(moviespersonal$adult)
moviespersonal$budget <- as.numeric(moviespersonal$budget)
moviespersonal$id <- as.integer(moviespersonal$id)
moviespersonal$popularity <- as.numeric(moviespersonal$popularity)
moviespersonal$release_date <- as.Date(moviespersonal$release_date)
moviespersonal$revenue <- as.numeric(moviespersonal$revenue)
moviespersonal$runtime <- as.numeric(moviespersonal$runtime)
moviespersonal$status <- as.factor(moviespersonal$status)

Now that they are in their correct data type/class, we can begin to change the way the data is shown.
For example, the column “genres” not only contains the genres of the movie, but also a lot of seemingly random punctuation marks. “[{‘id’: 28, ‘name’: ‘Action’}, {‘id’: 12, ‘name’: ‘Adventure’},

The punctuation makes the data seem messy and not easy to read.

Now we are creating a new column to only show the genres to make it easy to read.

## [1] "'name': 'Animation', 'name': 'Comedy', 'name': 'Family'"               
## [2] "'name': 'Adventure', 'name': 'Fantasy', 'name': 'Family'"              
## [3] "'name': 'Romance', 'name': 'Comedy'"                                   
## [4] "'name': 'Comedy', 'name': 'Drama', 'name': 'Romance'"                  
## [5] "'name': 'Comedy'"                                                      
## [6] "'name': 'Action', 'name': 'Crime', 'name': 'Drama', 'name': 'Thriller'"

We can now see the genres on a cleaner format but it is also showing name, now to get rid of this unnecessary text we run the following code.

## [1] "Animation Comedy Family"     "Adventure Fantasy Family"   
## [3] "Romance Comedy"              "Comedy Drama Romance"       
## [5] "Comedy"                      "Action Crime Drama Thriller"

Now we can clearly read the genres of the movie on a new column called genre_names. but they still are as chr, we need to convert them to factor to properly analyze the data.

##     (Other)       Drama      Comedy Documentary 
##       13695        5000        3621        2723

Data Cleaning

Now as we read the values from the dataframe we notice that on revenue there are a lot of values as 0, which are possibly registered as such because the revenue data was not possible to find or because of a different error. This error can mess up the statistics and drop the mean value to a lower number and affect the analysis of this year movies.

To solve this we will replace the values from 0 to the Revenue’s mean

mean_revenue <- mean(moviespersonal$revenue[moviespersonal$revenue > 0])

# Replace the zero values with the mean revenue
moviespersonal$revenue[moviespersonal$revenue == 0] <- mean_revenue

Now we also replace NA values with the column’s mean

# Check if there are any NA values in the revenue column
any_na <- any(is.na(moviespersonal$revenue))

# If there are NA values, replace them with the mean revenue (excluding NA and zero values)
if(any_na) {
  # Recalculate the mean in case the zero replacement affected it, excluding NA values
  mean_revenue <- mean(moviespersonal$revenue[moviespersonal$revenue > 0], na.rm = TRUE)
  
  # Replace NA values with the new mean revenue
  moviespersonal$revenue[is.na(moviespersonal$revenue)] <- mean_revenue
}

We repeat the same process now on the column Budget.

# Calculate the budget's mean value excluding NA's and zero's
mean_budget <- mean(moviespersonal$budget[moviespersonal$budget > 0], na.rm = TRUE)

# Replace zero's with the budget's mean value
moviespersonal$budget[moviespersonal$budget == 0] <- mean_budget

# Replace NA's with the budget's mean value
moviespersonal$budget[is.na(moviespersonal$budget)] <- mean_budget

Data Explorer

For this next part, we are doing a DataExplorer report of the Movies database.

# We first install and call the DataExplorer library

#install.packages("DataExplorer")
library(DataExplorer)

# Now we can create a report using the library and the database

create_report(moviespersonal)

This past function gives a general overview of the data contained in the database: basic statistics of the values in the database, data structures, histograms, etc. The general overview will be displayed in html format in a browser tab.

As shown in the principal component analysis, there are 43050 categories in the original title column, since shouldn’t be possible since there shouldn’t be repeat movies in the database.

The principal component analysis shows us how there is only 45089 categories in the IMDb id column, which may indicate that some movies may not be in the International Movie Database, thus these movies may lack information such as average score (based on vote_average) and how many reviews it has (based on vote_count)

There is also some movies which are lacking an overview (or a synopsis). This particular column however may not result as efficient to analize in the longrun since it would mean to analize each overview individually, and with more than 40 thousand objects this task results extremely difficult.

Analayzing missing values

In our DataExplorer report we found out that runtime is the only column that still has has missing values that can actually be corrected, we will now analyze the data and use the package Mice to fill out missing values.

We apply the function to visualize missing data

# We use the visdat package
library(visdat)
summary(moviespersonal$runtime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   85.00   95.00   93.78  107.00 1256.00     251

vis_miss(moviespersonal, warn_large_data = FALSE)

Now we see only release date is the noticeable missing data but it can’t be solved since those are dates.

Create a variable to identify missing values in ‘runtime’

moviespersonal <- moviespersonal %>%
  mutate(missing_inf = is.na(runtime))

Analyze if there’s any relationship with popularity

moviespersonal %>%
  group_by(missing_inf) %>%
  summarize(avg_popularity = mean(popularity))

## # A tibble: 2 × 2
##   missing_inf avg_popularity
##   <lgl>                <dbl>
## 1 FALSE                 2.93
## 2 TRUE                 NA

We notice it is difficult to come to a conclusion

Use MICE to impute missing values in ‘runtime’

#install.packages("mice")
library("mice", quietly = TRUE)

#method_list <- rep("", ncol(moviespersonal))

# Assign the names of the columns from the moviespersonal dataset to the method_list
#names(method_list) <- colnames(moviespersonal)

# Set the method for the 'runtime' column to "pmm" (predictive mean matching)
#method_list["runtime"] <- "pmm"

# Now run the mice function with the specified method_list
#accounts_mice <- mice(moviespersonal, m=1, maxit=50, meth="pmm", seed=500)

Mice can’t solve this missing values since they are to small of a sample, now we use mean to solve the missing values.

# Calculate the mean of the runtime column, excluding NA values
runtime_mean <- mean(moviespersonal$runtime, na.rm = TRUE)

# Replace NA values in the runtime column with the mean
moviespersonal <- moviespersonal %>%
  mutate(runtime = ifelse(is.na(runtime), runtime_mean, runtime))

moviespersonal %>%
  arrange(popularity) %>%
  vis_miss(warn_large_data = FALSE)

Calculate the mean of ‘runtime’ to impute missing values

runtime_mean <- mean(moviespersonal$runtime, na.rm = TRUE)

Replace missing values in ‘runtime’ with the mean

moviespersonal <- moviespersonal %>%
  mutate(runtime = ifelse(is.na(runtime), runtime_mean, runtime))

Visualize again the distribution of missing values of ‘runtime’ sorted by popularity

moviespersonal %>%
  arrange(popularity) %>%
  vis_miss(warn_large_data = FALSE)

Interpretation of the Report

In the project we carried out, we began by analyzing the information, cleaning it and deleting null values, or duplicates, since having the data clean is the longest and most important part so that the information we analyze next is of good quality. We also identified the incorrect data types and their outliers and removed the anomalies. Gender data is specifically addressed, cleaning it from a JSON-like string format to a more readable form, which is crucial for any further analysis.

It also handles zero values in the revenue and budget columns, and replaces them with the mean of the non-zero values, ensuring that no NA values are left untreated. This step is intended to prevent the analysis from being biased due to missing or unreported financial data.

An important part of the process we perform is ensuring data uniqueness by removing duplicate movie titles, which is crucial for accurate movie counts in your analysis.

In summary the work indicates a meticulous and systematic approach to prepare the data set for robust statistical analysis and possible model construction. However, the method we chose to address missing and zero values could be further examined to ensure it is the most appropriate approach, as it may have implications for the integrity of the analysis, especially if the zeros have real meaning within the set. of data. After this initial data preparation, you will be ready for exploratory data analysis to uncover insights and inform any hypotheses or models you plan to build.

Statistical Analysis

Tables

We analyze Adult movies by status

table1 <- table(moviespersonal$status, moviespersonal$adult)
prop.table(table1) * 100

##                  
##                          False         True
##                    0.182378020  0.000000000
##   Canceled         0.004737091  0.000000000
##   In Production    0.042633823  0.002368546
##   Planned          0.033159640  0.000000000
##   Post Production  0.215537660  0.000000000
##   Released        99.005210801  0.018948366
##   Rumored          0.495026054  0.000000000

We notice there are only 1% of adult rated films released

Bar plot using ggplot2

Now we graphically see the difference

Also, for this particular part of the setup, we will be using the Esquisse package, which allows us to create more detailed and, in all fairness, prettier graphics than ggplot2

It is important to note that rather than different than calling a library, Esquisse is used by interacting with the Addins menu on the top of the screen

Add a column with only the release year

moviespersonal$year <- substr(moviespersonal$release_date, 1, 4)

Scatter plot of average vote vs. revenue

Scatter plot of budget vs. revenue

Interpretation of the Report

In this section, we focused on generating tables and visualizations to glean insights from the dataset. Our analysis began with the creation of contingency tables to explore the relationship between movie status and adult content, as well as between video availability and adult content. These tables provide a comprehensive overview of the distribution of these variables within the dataset.

Furthermore, we leveraged the power of data visualization to gain deeper insights. Using the ggplot2 package, we crafted a bar plot to visually represent the distribution of movie status across different adult content categories. This graphical representation offers a clear depiction of how movie status varies with respect to adult content.

Additionally, we utilized scatter plots to examine relationships between key variables. Specifically, we explored the relationship between average vote and revenue, as well as between budget and revenue. These scatter plots enable us to visualize potential trends or patterns within the data, shedding light on any underlying relationships between these variables.

In summary, our analysis in this section underscores the importance of both tabular and graphical representations in uncovering insights from the dataset. Through careful examination of tables and visualizations, we gain a deeper understanding of the relationships and trends present in the data, laying the foundation for further analysis and interpretation.

Advanced Visualization

# With this particular line of code, we are calculating the profit of each movie
moviespersonal <- moviespersonal %>%
  mutate(profit= revenue - budget)

Vote Average/Genre vs. Revenue

With this code, we are computing means by groups, in which we are checking the mean revenue based on the score on the Internet Movie Database (IMDb). This with the intention to prove that not necessarily because a movie has a high score online it’s going to perform well in the box office regard.

# This line of code helps us with stopping the numbers from appearing in scientific notation
options(scipen = 999)

# Now we code the mean revenue based on the average score on IMDb
mean_revenue <- moviespersonal %>%
  group_by(vote_average) %>%
  summarize(mean_revenue = mean(revenue, na.rm = TRUE)) %>%
  arrange(desc(mean_revenue))

# With this we print the function
print(mean_revenue)

## # A tibble: 92 × 2
##    vote_average mean_revenue
##           <dbl>        <dbl>
##  1          7.9    87140789.
##  2          8.1    87125416.
##  3          9.1    84393695.
##  4          7.6    82704986.
##  5          8.2    79617362.
##  6          8.3    78486765.
##  7          7.5    77187057.
##  8          7.4    75553881.
##  9          7.7    73748020.
## 10          6.7    72205531.
## # ℹ 82 more rows

With this code we are going to find out the mean revenue per genre, since the genres are summed to each other as movies can have more than one genre we first separate de genres an then count the revenue and get the mean if the genre is in the movie.

# Create a new column with separated genres into new rows, keeping the original 'genre_names' column unchanged
movies_with_separated_genres <- moviespersonal %>%
  mutate(separated_genres = genre_names) %>%
  separate_rows(separated_genres, sep = " ") %>%
  select(genre_names, separated_genres, revenue, profit)

# Get unique genres from the new 'separated_genres' column
unique_genres <- movies_with_separated_genres %>%
  select(separated_genres) %>%
  distinct() %>%
  pull(separated_genres)

# Initialize a dataframe to store the average revenue for each unique genre
mean_revenue_by_genre <- data.frame(genre = character(), mean_revenue = numeric(), stringsAsFactors = FALSE)

# Iterate over each unique genre to calculate the average revenue
for(genre in unique_genres) {
  mean_revenue <- movies_with_separated_genres %>%
    # Select rows where the new 'separated_genres' column contains the current genre
    filter(str_detect(separated_genres, fixed(genre))) %>%
    # Calculate the average revenue for the selected genre
    summarize(mean_revenue = mean(revenue, na.rm = TRUE)) %>%
    # Extract the average revenue value
    pull(mean_revenue)
  
  # Append the genre and its average revenue to the dataframe
  mean_revenue_by_genre <- rbind(mean_revenue_by_genre, data.frame(genre = genre, mean_revenue = mean_revenue))
}

# Display the result
print(mean_revenue_by_genre)

##          genre mean_revenue
## 1    Animation     89334110
## 2       Comedy     67805376
## 3       Family     89513534
## 4    Adventure    103618594
## 5      Fantasy     94402872
## 6      Romance     64738395
## 7        Drama     63925945
## 8       Action     81127887
## 9        Crime     66435133
## 10    Thriller     68885257
## 11      Horror     64464103
## 12     History     64717137
## 13     Science     83469618
## 14     Fiction     83469618
## 15     Mystery     67625498
## 16         War     68262127
## 17     Foreign     65258394
## 18                      NaN
## 19       Music     65440574
## 20 Documentary     65198337
## 21     Western     65258973
## 22          TV     68747587
## 23       Movie     68747587

Statistical observations

Using e1071 package

library("e1071")
mean(moviespersonal$profit)

## [1] 47103572

var(moviespersonal$profit)

## [1] 2482584026353584

sd(moviespersonal$profit)

## [1] 49825536

median(moviespersonal$profit)

## [1] 47183112

quantile(moviespersonal$profit)

##         0%        25%        50%        75%       100% 
## -111007242   47183112   47183112   47183112 2550965087

min(moviespersonal$profit)

## [1] -111007242

max(moviespersonal$profit)

## [1] 2550965087

range(moviespersonal$profit)

## [1] -111007242 2550965087

skewness(moviespersonal$profit)

## [1] 13.28206

We now know that the distribution is skewed to the right. With this data, we can compute the confidence interval and correlation coefficient.

profit_mean <-mean(moviespersonal$profit)
profit_sd <-sd(moviespersonal$profit) 
profit_dev <- c(profit_mean - profit_sd, profit_mean + profit_sd)
profit_dev

## [1] -2721964 96929108

The profit +- the standard deviation is saved in the object “profit_dev”

#Coefficient of variation
CV <- sd(moviespersonal$profit)/mean(moviespersonal$profit)
CV

## [1] 1.057787

#Trimmed mean
mean(moviespersonal$profit, trim=.1)

## [1] 46343198

#Z scores 
#"scale" automatically gives you the zscores
moviespersonal <- moviespersonal %>%
  mutate(zscore = scale(moviespersonal$profit))
moviespersonal %>% 
  filter(zscore < 3)
# We removed all rows with more than 3 zscores (438 rows)

Analyzing statistical observations

Now, we will add a new column called words_title which will tell us how many words are in the title. This column will then be grouped and compared with the profit per each different word count

moviespersonal$words_title <- str_count(moviespersonal$original_title, "\\w+")
moviespersonal %>%
  group_by(words_title) %>%
  summarize(mean_profit = mean(profit),
            sd_profit = sd(profit),
            median_profit = median(profit),
            quantile_profit = quantile(profit, 0.90),
            count = n()) %>%
  arrange(desc(mean_profit))

## # A tibble: 20 × 6
##    words_title mean_profit  sd_profit median_profit quantile_profit count
##          <int>       <dbl>      <dbl>         <dbl>           <dbl> <int>
##  1          10   64071620. 117714391.     47183112.       49509867.   127
##  2           8   57643350. 102907506.     47183112.       47183112.   510
##  3           7   54891330.  78345824.     47183112.       52947905    896
##  4          11   54448920.  62437300.     47183112.       60466534.    73
##  5          16   53851205.  11549475.     47183112.       63186534.     3
##  6          12   51830574.  33066573.     47183112.       47183112.    41
##  7           9   50407416.  49948258.     47183112.       47183112.   245
##  8          14   48806518.   5853273.     47183112.       47183112.    13
##  9           5   48545257.  56870932.     47183112.       47183112.  3575
## 10           6   48448381.  51368166.     47183112.       47183112.  1792
## 11           4   47456572.  44430688.     47183112.       47183112.  6254
## 12          15   47183112.         0      47183112.       47183112.     9
## 13          17   47183112.         0      47183112.       47183112.     2
## 14          18   47183112.         0      47183112.       47183112.     2
## 15          19   47183112.         0      47183112.       47183112.     3
## 16           3   46681369.  42229807.     47183112.       48787390.  9280
## 17           1   45982691.  54139116.     47183112.       60787390.  8056
## 18           2   45970169.  44364092.     47183112.       60927390. 11314
## 19          13   44236877.  10057832.     47183112.       47183112.    24
## 20          20   24350000         NA      24350000        24350000      1

Now we will group by year to see which years were the most profitable for cinema

moviespersonal %>%
  group_by(year) %>%
  summarize(mean_profit = mean(profit),
            sd_profit = sd(profit),
            median_profit = median(profit),
            quantile_profit = quantile(profit, 0.90),
            count = n()) %>%
  arrange(desc(mean_profit))

## # A tibble: 136 × 6
##    year  mean_profit sd_profit median_profit quantile_profit count
##    <chr>       <dbl>     <dbl>         <dbl>           <dbl> <int>
##  1 1902    68781405.       NA      68781405.       68781405.     1
##  2 2017    56221083. 81988788.     47183112.       64627390.   469
##  3 1905    51503961.  9661710.     47183112.       60145657.     5
##  4 2016    51400780. 69440583.     47183112.       63787390.  1442
##  5 2015    50584676. 81454255.     47183112.       58487390.  1744
##  6 1977    50302191. 46357949.     47183112.       47183112.   312
##  7 1904    50268366.  8162815.     47183112.       55821823.     7
##  8 2018    49824823.  4193441.     47183112.       54387390.     5
##  9 2012    49425227. 60546406.     47183112.       65787390.  1589
## 10 2011    49410075. 59204531.     47183112.       66987390.  1535
## # ℹ 126 more rows

The year with the most mean profit was 1902, the 2017, then 1905.

ggplot(moviespersonal, aes(x = year, y = profit)) + 
  geom_density() + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ggtitle("Profit by Year")

Analyzing correlation between profit and other variables.

We will continue using the word count vs profit to find the correlation between them (if any)

## `geom_smooth()` using formula = 'y ~ x'

The linear model applied to the data suggests a relatively flat slope, indicating that there is minimal change in profit as the number of words in the title increases.

Now, we compute the Pearson coefficient of correlation

moviespersonal %>% 
  summarize(N = n(), r = cor(profit, words_title))

##       N          r
## 1 42220 0.03440706

The correlation between word count in the title and profit by the movie is extemely low / null, 0.03.

We will do this with the budget and the popularity to see if there are any positive correlations.

## `geom_smooth()` using formula = 'y ~ x'

##       N         r
## 1 42220 0.4841505

Although still not a very strong correlation, budget and profit have a moderate correlation of 0.48.

Now using Profit and Revenue

## `geom_smooth()` using formula = 'y ~ x'

moviespersonal %>% 
  summarize(N = n(), r = cor(revenue, profit))

##       N         r
## 1 42220 0.9745676

We obviously see a strong correlation

Now using Profit and Popularity

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

##       N  r
## 1 42220 NA

For some odd reason, we cannot see the correlation coefficient for popularity, so we will try with another variable.

## `geom_smooth()` using formula = 'y ~ x'

##       N          r
## 1 42067 0.04163259

To get rid of movies that are longer than 4 hours in runtime, we used the filter function. Again, we cannot see the value for the coefficient, but from seeing the line and the points, we can predict that this positive correlation has a moderate correlation.

We can see that the revenues per genre are really spread, that’s because of the data quantity. We can also notice that Foreign, Movie and TV having really low revenue and Action, adventure, fantasy fiction and science have extreme outliers since the movie avatar has this genres and is the highest revenue of this year.

Now we analize the new column that we created named profit

## # A tibble: 23 × 6
##    separated_genres mean_profit  sd_profit median_profit quantile_90 count
##    <chr>                  <dbl>      <dbl>         <dbl>       <dbl> <int>
##  1 "Adventure"        71428203. 134477234.     47183112.   99600000   3231
##  2 "Fantasy"          65023884. 124170024.     47183112.   68739390.  2119
##  3 "Animation"        62345097.  95348486.     47183112.   67987390.  1846
##  4 "Family"           61621163.  97836888.     47183112.   67507390.  2590
##  5 "Fiction"          57536350. 104793345.     47183112.   68193990.  2829
##  6 "Science"          57536350. 104793345.     47183112.   68193990.  2829
##  7 "Action"           54819226.  91432796.     47183112.   66787390.  6119
##  8 "Movie"            47984308.   3848747.     47183112.   47183112.   673
##  9 "TV"               47984308.   3848747.     47183112.   47183112.   673
## 10 ""                 47093271.   6241410.     47183112.   47183112.  2280
## # ℹ 13 more rows

These results suggest that while certain genres like “Adventure,” “Fantasy,” and “Animation” have the potential for very high profits, this may be because of Avatar having this genres but we will visualize it on the next part, there is significant variability within each genre. Genres like “Drama” and “Comedy” are prolific but tend to have lower average profits. This analysis can inform movie production decisions, particularly if the goal is to maximize profitability by focusing on genres with higher earning potential.

Skewness and Log Transformation Analysis

## Skewness of revenue: 12.78445

## Skewness of budget: 7.167064

Distribution of original and transformed Budget

Original Budget

Density of Original Budget The original budget distribution is highly right-skewed, indicating that most movies have a low budget, while a few movies have a significantly higher budget. The long tail to the right shows that there are outliers with exceptionally large budgets compared to the rest.

Transformed Budget

Density of Log-Transformed Budget The log transformation has normalized the distribution of budget values, as evidenced by the peak around 15 to 17 on the log_budget scale. This suggests that a majority of movies have a budget within a mid-range when the scale is logarithmic, reducing the influence of extreme values.

Distribution of original and transformed Revenue

Original Revenue

Density of Original Revenue The density plot for the original revenue values demonstrates extreme right skewness, where the majority of movies earn revenue in the lower range, while very few have extraordinarily high revenues, creating a long right tail. This skewness could be due to blockbuster hits which are rare but generate a massive amount of revenue.

Transformed Revenue

Density of Log-Transformed Revenue After the logarithmic transformation, the revenue data shows a more bell-shaped distribution, though there is still a slight right skew. This indicates a more normalized distribution, but with a few outliers with exceptionally high revenue, visible from the tail extending to the right.

Summarized data

We now show the summarized and transformed data to see the central tendency an variability from Revenue and Budget

##   mean_log_revenue median_log_revenue iqr_log_revenue mean_log_budget
## 1         17.70161           18.04653               0        16.58778
##   median_log_budget iqr_log_budget
## 1           16.8884              0

Now we will use rnorm to visualize a simulated histogram for profit

# The following code is used to examine the distribution of revenues produced by the movies through the simulations. 
# We visualize this distribution as a histogram to make it easier to understand the data and to calculate the probabilities that the movies generate less than a certain amount of revenues, assuming that these revenues are distributed in a regular way.

mean_revenue <- mean(moviespersonal$revenue, na.rm = TRUE)
sd_revenue <- sd(moviespersonal$revenue, na.rm = TRUE)

n10000 <- rnorm(10000, mean = mean_revenue, sd = sd_revenue)

# Create a histogram
hist(n10000, breaks = 100, main = "Histogram of Simulated Revenues", 
     xlab = "Revenue", ylab = "Frequency", col = "lightblue")

value_to_evaluate <- 5
probability <- pnorm(value_to_evaluate, mean = mean_revenue, sd = sd_revenue, lower.tail = TRUE)

# Display the probability result
cat("The probability that the revenue is less than", value_to_evaluate, "is:", probability, "\n")

## The probability that the revenue is less than 5 is: 0.1210488

The code identifies on average how much the films earn and how much the earnings of each film vary, with some films earning a lot and others not generating much revenue. This is followed by the simulation of how the earnings would be in small groups of movies in order to see if the variation and the average remain steady. Once we have this information we create the histograms to show graphically how the earnings will be distributed depending on the simulated group, i.e. which are the movies that have a lot of earnings and which are the ones that have few. Finally, we calculate how likely it is that a film can generate less than 5 million dollars, based on the calculated variation and the average.

Overall, this distribution suggests that while extreme financial outcomes are possible in the movie industry, most movies hover around a moderate revenue range, with an equal likelihood of small profits or losses.

Evidence

Dereck Iker VIllafaña Romero A00573756

2024-05-03

Executive summmary

Introduction.

Objectives.

Data Description

Calling Packages

Summary observations

Changing Data Types

Data Cleaning

Data Explorer

Analayzing missing values

We apply the function to visualize missing data

Create a variable to identify missing values in ‘runtime’

Analyze if there’s any relationship with popularity

Use MICE to impute missing values in ‘runtime’

Calculate the mean of ‘runtime’ to impute missing values

Replace missing values in ‘runtime’ with the mean

Visualize again the distribution of missing values of ‘runtime’ sorted by popularity

Interpretation of the Report

Statistical Analysis

Tables

Bar plot using ggplot2

Add a column with only the release year

Scatter plot of average vote vs. revenue

Scatter plot of budget vs. revenue

Interpretation of the Report

Advanced Visualization

Vote Average/Genre vs. Revenue

Statistical observations

Analyzing statistical observations

Analyzing correlation between profit and other variables.

Skewness and Log Transformation Analysis

Distribution of original and transformed Budget

Original Budget

Transformed Budget

Distribution of original and transformed Revenue

Original Revenue

Transformed Revenue

Summarized data