Executive summmary

The movies dataset from Kaggle provides a long list of movies from several different countries. Other than the names of the movies, the dataset includes the run time, original language, status, tagline, etc, and our most important variables, revenue and budget. We believe these are our critical values because with these we can create, or mutate a new column called profit. With profit declared, we can begin to analyze our problem setup question:

What makes a movie successful or unsuccessful?

We analyzed the data using statistical tools like Pearson’s Coefficient of Correlation to compare various variables.
1. Words in the title - Profit, r^2 = 0.03, “Extremely low/null”
2. Budget - Profit, r^2 = 0.48, “Moderate”
3. Revenue - Profit, r^2 = 0.97, “Very strong”
3. Popularity - Profit, r^2 = “NULL”
4. Run time(<250minutes) - Profit, r^2 = “NULL”

The year with the most profit.
1. 1902
2. 2017
3. 1905,
while the lowest was 1988.

The genres with the most profit
1. Adventure
2. Fantasy
3. Animation,
while the lowest was History.

Vote Averages with the most revenue
1. 7.9
2. 8.1
3. 9.1
Interestingly, 10.0 vote is in 43rd place.

Word count in titles with the most profit
1. 10
2. 8
3. 7

Progress setup 1 - Data Wrangling/Cleaning

To answer the question of what makes a movie successful or unsuccesful, we first cleaned the data using functions like - filter
- as.factor/character/numeric etc.
- vis_miss
- group_by
- mutate

First, we read the CSV provided for the problem setup.

moviespersonal <- read.csv("/Users/carosuarez/Downloads/movies_metadata.csv")

Calling Packages

We call the packages we plan to use to clean the data provided.

#install.packages("jsonlite")
library("dplyr", quietly = TRUE)
library("stringr", quietly = TRUE)

# We also install the packages we will use to create graphs later
#install.packages("esquisse")
#install.packages("ggplot2")
library("ggplot2", quietly = TRUE)
#install.packages("tidyverse")
library("tidyverse", quietly = TRUE)

Summary observations

We run a summary on the dataset to see if we can identify anything that we need to “clean”

summary(moviespersonal)
##     adult           belongs_to_collection    budget             genres         
##  Length:45466       Length:45466          Length:45466       Length:45466      
##  Class :character   Class :character      Class :character   Class :character  
##  Mode  :character   Mode  :character      Mode  :character   Mode  :character  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##    homepage              id              imdb_id          original_language 
##  Length:45466       Length:45466       Length:45466       Length:45466      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  original_title       overview          popularity        poster_path       
##  Length:45466       Length:45466       Length:45466       Length:45466      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  production_companies production_countries release_date      
##  Length:45466         Length:45466         Length:45466      
##  Class :character     Class :character     Class :character  
##  Mode  :character     Mode  :character     Mode  :character  
##                                                              
##                                                              
##                                                              
##                                                              
##     revenue             runtime        spoken_languages      status         
##  Min.   :0.000e+00   Min.   :   0.00   Length:45466       Length:45466      
##  1st Qu.:0.000e+00   1st Qu.:  85.00   Class :character   Class :character  
##  Median :0.000e+00   Median :  95.00   Mode  :character   Mode  :character  
##  Mean   :1.121e+07   Mean   :  94.13                                        
##  3rd Qu.:0.000e+00   3rd Qu.: 107.00                                        
##  Max.   :2.788e+09   Max.   :1256.00                                        
##  NA's   :6           NA's   :263                                            
##    tagline             title              video            vote_average   
##  Length:45466       Length:45466       Length:45466       Min.   : 0.000  
##  Class :character   Class :character   Class :character   1st Qu.: 5.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 5.618  
##                                                           3rd Qu.: 6.800  
##                                                           Max.   :10.000  
##                                                           NA's   :6       
##    vote_count     
##  Min.   :    0.0  
##  1st Qu.:    3.0  
##  Median :   10.0  
##  Mean   :  109.9  
##  3rd Qu.:   34.0  
##  Max.   :14075.0  
##  NA's   :6

At first glance, we see a couple of columns that are in the wrong data class.
adult, character -> factor. (True/False)
budget, chr -> numeric/integer
id, chr -> integer
popularity, chr -> numeric
release_date, chr -> date
revenue, chr -> numeric
runtime, chr -> numeric
status, chr -> factor (Released/Unreleased)

We need to change these into their correct class to avoid any problem in the analysis.

Furthermore, we see some outliers in the data that don’t allow for a correct analysis.
Lines 19731, 29504, and 35588 have errors in every single column. Seeing that we have 45,466 rows of data, our findings will not be affected if we remove these 3 errors.

moviespersonal <- moviespersonal[-c(19731, 29504, 35588), ]

Changing Data Types

Once this is done, we can finally turn the columns into their correct data class.

moviespersonal$adult <- as.factor(moviespersonal$adult)
moviespersonal$budget <- as.numeric(moviespersonal$budget)
moviespersonal$id <- as.integer(moviespersonal$id)
moviespersonal$popularity <- as.numeric(moviespersonal$popularity)
moviespersonal$release_date <- as.Date(moviespersonal$release_date)
moviespersonal$revenue <- as.numeric(moviespersonal$revenue)
moviespersonal$runtime <- as.numeric(moviespersonal$runtime)
moviespersonal$status <- as.factor(moviespersonal$status)

Now that they are in their correct data type/class, we can begin to change the way the data is shown.
For example, the column “genres” not only contains the genres of the movie, but also a lot of seemingly random punctuation marks. “[{‘id’: 28, ‘name’: ‘Action’}, {‘id’: 12, ‘name’: ‘Adventure’},

The punctuation makes the data seem messy and not easy to read.

Now we are creating a new column to only show the genres to make it easy to read.

pattern <- "'name':\\s*'([^']+)'"

# Use sapply with str_extract_all to extract all genre names and collapse them into a single string
moviespersonal$genre_names <- sapply(str_extract_all(moviespersonal$genres, pattern), function(x) paste(x, collapse = ", "))

head(moviespersonal$genre_names)
## [1] "'name': 'Animation', 'name': 'Comedy', 'name': 'Family'"               
## [2] "'name': 'Adventure', 'name': 'Fantasy', 'name': 'Family'"              
## [3] "'name': 'Romance', 'name': 'Comedy'"                                   
## [4] "'name': 'Comedy', 'name': 'Drama', 'name': 'Romance'"                  
## [5] "'name': 'Comedy'"                                                      
## [6] "'name': 'Action', 'name': 'Crime', 'name': 'Drama', 'name': 'Thriller'"

We can now see the genres on a cleaner format but it is also showing name, now to get rid of this unnecessary text we run the following code.

moviespersonal$genre_names <- gsub("'name':\\s'", "", moviespersonal$genre_names)

# Remove the ending single quotes and any commas directly following it
moviespersonal$genre_names <- gsub("',\\s", " ", moviespersonal$genre_names)

# You may also want to remove the single quotes at the very end of each string if they exist
moviespersonal$genre_names <- gsub("'$", "", moviespersonal$genre_names)

# Check the result
head(moviespersonal$genre_names)
## [1] "Animation Comedy Family"     "Adventure Fantasy Family"   
## [3] "Romance Comedy"              "Comedy Drama Romance"       
## [5] "Comedy"                      "Action Crime Drama Thriller"

Now we can clearly read the genres of the movie on a new column called genre_names. but they still are as chr, we need to convert them to factor to properly analyze the data.

##     (Other)       Drama      Comedy Documentary 
##       13695        5000        3621        2723

Problem Setup 2 - NA’s

Now as we read the values from the dataframe we notice that on revenue there are a lot of values as 0, which are possibly registered as such because the revenue data was not possible to find or because of a different error. This error can mess up the statistics and drop the mean value to a lower number and affect the analysis of this year movies.

To solve this we will replace the values from 0 to the Revenue’s mean

mean_revenue <- mean(moviespersonal$revenue[moviespersonal$revenue > 0])

# Replace the zero values with the mean revenue
moviespersonal$revenue[moviespersonal$revenue == 0] <- mean_revenue

Now we also replace NA values with the column’s mean

# Check if there are any NA values in the revenue column
any_na <- any(is.na(moviespersonal$revenue))

# If there are NA values, replace them with the mean revenue (excluding NA and zero values)
if(any_na) {
  # Recalculate the mean in case the zero replacement affected it, excluding NA values
  mean_revenue <- mean(moviespersonal$revenue[moviespersonal$revenue > 0], na.rm = TRUE)
  
  # Replace NA values with the new mean revenue
  moviespersonal$revenue[is.na(moviespersonal$revenue)] <- mean_revenue
}

We repeat the same process now on the column Budget.

# Calculate the budget's mean value excluding NA's and zero's
mean_budget <- mean(moviespersonal$budget[moviespersonal$budget > 0], na.rm = TRUE)

# Replace zero's with the budget's mean value
moviespersonal$budget[moviespersonal$budget == 0] <- mean_budget

# Replace NA's with the budget's mean value
moviespersonal$budget[is.na(moviespersonal$budget)] <- mean_budget

Data Explorer

For this next part, we are doing a DataExplorer report of the Movies database.

# We first install and call the DataExplorer library

#install.packages("DataExplorer")
library(DataExplorer)

# Now we can create a report using the library and the database

create_report(moviespersonal)

This past function gives a general overview of the data contained in the database: basic statistics of the values in the database, data structures, histograms, etc. The general overview will be displayed in html format in a browser tab.

As shown in the principal component analysis, there are 43050 categories in the original title column, since shouldn’t be possible since there shouldn’t be repeat movies in the database.

The principal component analysis shows us how there is only 45089 categories in the IMDb id column, which may indicate that some movies may not be in the International Movie Database, thus these movies may lack information such as average score (based on vote_average) and how many reviews it has (based on vote_count)

There is also some movies which are lacking an overview (or a synopsis). This particular column however may not result as efficient to analize in the longrun since it would mean to analize each overview individually, and with more than 40 thousand objects this task results extremely difficult.

Problem Setup 3

Analayzing missing values

In our DataExplorer report we found out that runtime is the only column that still has has missing values that can actually be corrected, we will now analyze the data and use the package Mice to fill out missing values.

summary(moviespersonal$runtime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   85.00   95.00   93.78  107.00 1256.00     251

We see that we have 251 missing values Now we visualize the data ## Vis_miss

#We install the library
#install.packages("visdat")
library(visdat)
vis_miss(moviespersonal, warn_large_data = FALSE)

It seems that there is not a lot of data missing, Release date is not a replaceable column so we will be working on runtime which has a 1% data missing.

moviespersonal <- moviespersonal %>%
  mutate(missing_inf = is.na(runtime))

Now we analize if there is a relation across popularity.

moviespersonal %>%
  group_by(missing_inf) %>%
  summarize(avg_popularity = mean(popularity))
## # A tibble: 2 × 2
##   missing_inf avg_popularity
##   <lgl>                <dbl>
## 1 FALSE                 2.93
## 2 TRUE                 NA

We find out that less popular movies are missing runtime values

moviespersonal %>%
  arrange(popularity) %>%
  vis_miss(warn_large_data = FALSE)

Mice Package

Now we use the package mice

#install.packages("mice")
library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
method_list <- rep("", ncol(moviespersonal))

# Assign the names of the columns from the moviespersonal dataset to the method_list
names(method_list) <- colnames(moviespersonal)

# Set the method for the 'runtime' column to "pmm" (predictive mean matching)
method_list["runtime"] <- "pmm"

# Now run the mice function with the specified method_list
#accounts_mice <- mice(moviespersonal, m=1, maxit=50, meth="pmm", seed=500)

Mice can’t solve this missing values since they are to small of a sample, now we use mean to solve the missing values.

# Calculate the mean of the runtime column, excluding NA values
runtime_mean <- mean(moviespersonal$runtime, na.rm = TRUE)

# Replace NA values in the runtime column with the mean
moviespersonal <- moviespersonal %>%
  mutate(runtime = ifelse(is.na(runtime), runtime_mean, runtime))

moviespersonal %>%
  arrange(popularity) %>%
  vis_miss(warn_large_data = FALSE)

Interpretation of the Report

In the project we carried out, we began by analyzing the information, cleaning it and deleting null values, or duplicates, since having the data clean is the longest and most important part so that the information we analyze next is of good quality. We also identified the incorrect data types and their outliers and removed the anomalies. Gender data is specifically addressed, cleaning it from a JSON-like string format to a more readable form, which is crucial for any further analysis.

It also handles zero values in the revenue and budget columns, and replaces them with the mean of the non-zero values, ensuring that no NA values are left untreated. This step is intended to prevent the analysis from being biased due to missing or unreported financial data.

An important part of the process we perform is ensuring data uniqueness by removing duplicate movie titles, which is crucial for accurate movie counts in your analysis.

In summary the work indicates a meticulous and systematic approach to prepare the data set for robust statistical analysis and possible model construction. However, the method we chose to address missing and zero values could be further examined to ensure it is the most appropriate approach, as it may have implications for the integrity of the analysis, especially if the zeros have real meaning within the set. of data. After this initial data preparation, you will be ready for exploratory data analysis to uncover insights and inform any hypotheses or models you plan to build.

Problem setup 4

table1 <- table(moviespersonal$status, moviespersonal$adult)
prop.table(table1)*100
##                  
##                          False         True
##                    0.182378020  0.000000000
##   Canceled         0.004737091  0.000000000
##   In Production    0.042633823  0.002368546
##   Planned          0.033159640  0.000000000
##   Post Production  0.215537660  0.000000000
##   Released        99.005210801  0.018948366
##   Rumored          0.495026054  0.000000000
table2 <- table(moviespersonal$video, moviespersonal$adult)
prop.table(table2)*100
##        
##                False         True
##          0.002368546  0.000000000
##   False 99.760776883  0.021316911
##   True   0.215537660  0.000000000
library(ggplot2)
ggplot(moviespersonal, aes(x = adult, fill = status)) + geom_bar() + facet_wrap(~adult)

Also, for this particular part of the setup, we will be using the Esquisse package, which allows us to create more detailed and, in all fairness, prettier graphics than ggplot2

It is important to note that rather than different than calling a library, Esquisse is used by interacting with the Addins menu on the top of the screen

# We add a column with just the year release, which can be usefull when working with Esquisse

moviespersonal$year <- substr(moviespersonal$release_date, 1, 4)

# It is important to note that the following graph was made using esquisse, as we used the function of copying the code onto the script.

# This graph shows us the movies between 1950 and 2023, comparing the average score they got on IMDb and how much money they made
library(dplyr)
library(ggplot2)

moviespersonal %>%
 filter(release_date >= "1950-05-03" & release_date <= "2023-04-01") %>%
 ggplot() +
 aes(x = vote_average, y = revenue) +
 geom_point(shape = "circle", size = 1.5, 
 colour = "#0C4C8A") +
 theme_classic()

# This graph shows us the movies between 1950 and 2023, comparing between budget and revenue, which tells us that not necessarily a movie with a high production value will have significant returns, not even to break even

library(dplyr)
library(ggplot2)

moviespersonal %>%
 filter(release_date >= "1950-05-04" & release_date <= "2023-04-01") %>%
 ggplot() +
 aes(x = budget, y = revenue) +
 geom_point(shape = "circle", size = 1.5, colour = "#112446") +
 labs(caption = "For reference, 2E+
08 equals 200 Million") +
 theme_minimal()

Problem setup 5

# With this particular line of code, we are calculating the profit of each movie
moviespersonal <- moviespersonal %>%
  mutate(profit= revenue - budget)

Vote Average/Genre vs. Revenue

# With this code, we are computing means by groups, in which we are checking the mean revenue based on the score on the Internet Movie Database (IMDb). This with the intention to prove that not necessarily because a movie has a high score online it's going to perform well in the box office regard.

# This line of code helps us with stopping the numbers from appearing in scientific notation
options(scipen = 999)

# Now we code the mean revenue based on the average score on IMDb
mean_revenue <- moviespersonal %>%
  group_by(vote_average) %>%
  summarize(mean_revenue = mean(revenue, na.rm = TRUE)) %>%
  arrange(desc(mean_revenue))

# With this we print the function
print(mean_revenue)
## # A tibble: 92 × 2
##    vote_average mean_revenue
##           <dbl>        <dbl>
##  1          7.9    87140789.
##  2          8.1    87125416.
##  3          9.1    84393695.
##  4          7.6    82704986.
##  5          8.2    79617362.
##  6          8.3    78486765.
##  7          7.5    77187057.
##  8          7.4    75553881.
##  9          7.7    73748020.
## 10          6.7    72205531.
## # ℹ 82 more rows
# With this code we are going to find out the mean revenue per genre, since the genres are summed to each other as movies can have more than one genre we first separate de genres an then count the revenue and get the mean if the genre is in the movie.

# Create a new column with separated genres into new rows, keeping the original 'genre_names' column unchanged
movies_with_separated_genres <- moviespersonal %>%
  mutate(separated_genres = genre_names) %>%
  separate_rows(separated_genres, sep = " ") %>%
  select(genre_names, separated_genres, revenue, profit)

# Get unique genres from the new 'separated_genres' column
unique_genres <- movies_with_separated_genres %>%
  select(separated_genres) %>%
  distinct() %>%
  pull(separated_genres)

# Initialize a dataframe to store the average revenue for each unique genre
mean_revenue_by_genre <- data.frame(genre = character(), mean_revenue = numeric(), stringsAsFactors = FALSE)

# Iterate over each unique genre to calculate the average revenue
for(genre in unique_genres) {
  mean_revenue <- movies_with_separated_genres %>%
    # Select rows where the new 'separated_genres' column contains the current genre
    filter(str_detect(separated_genres, fixed(genre))) %>%
    # Calculate the average revenue for the selected genre
    summarize(mean_revenue = mean(revenue, na.rm = TRUE)) %>%
    # Extract the average revenue value
    pull(mean_revenue)
  
  # Append the genre and its average revenue to the dataframe
  mean_revenue_by_genre <- rbind(mean_revenue_by_genre, data.frame(genre = genre, mean_revenue = mean_revenue))
}

# Display the result
print(mean_revenue_by_genre)
##          genre mean_revenue
## 1    Animation     89334110
## 2       Comedy     67805376
## 3       Family     89513534
## 4    Adventure    103618594
## 5      Fantasy     94402872
## 6      Romance     64738395
## 7        Drama     63925945
## 8       Action     81127887
## 9        Crime     66435133
## 10    Thriller     68885257
## 11      Horror     64464103
## 12     History     64717137
## 13     Science     83469618
## 14     Fiction     83469618
## 15     Mystery     67625498
## 16         War     68262127
## 17     Foreign     65258394
## 18                      NaN
## 19       Music     65440574
## 20 Documentary     65198337
## 21     Western     65258973
## 22          TV     68747587
## 23       Movie     68747587

Statistical observations

library("e1071")
mean(moviespersonal$profit)
## [1] 47103572
var(moviespersonal$profit)
## [1] 2482584026357099
sd(moviespersonal$profit)
## [1] 49825536
median(moviespersonal$profit)
## [1] 47183112
quantile(moviespersonal$profit)
##         0%        25%        50%        75%       100% 
## -111007242   47183112   47183112   47183112 2550965087
min(moviespersonal$profit)
## [1] -111007242
max(moviespersonal$profit)
## [1] 2550965087
range(moviespersonal$profit)
## [1] -111007242 2550965087
skewness(moviespersonal$profit)
## [1] 13.28206

We now know that the distribution is skewed to the right. With this data, we can compute the confidence interval and correlation coefficient.

profit_mean <-mean(moviespersonal$profit)
profit_sd <-sd(moviespersonal$profit) 
profit_dev <- c(profit_mean - profit_sd, profit_mean + profit_sd)
profit_dev
## [1] -2721964 96929108

The profit +- the standard deviation is saved in the object “profit_dev”

#Coefficient of variation
CV <- sd(moviespersonal$profit)/mean(moviespersonal$profit)
CV
## [1] 1.057787
#Trimmed mean
mean(moviespersonal$profit, trim=.1)
## [1] 46343198
#Z scores 
#"scale" automatically gives you the zscores
#moviespersonal <- moviespersonal %>%
  #mutate(zscore = scale(moviespersonal$profit))
#moviespersonal %>% 
  #filter(zscore < 3)
#We removed all rows with more than 3 zscores (438 rows)

Analyzing statistical observations

Now, we will add a new column called words_title which will tell us how many words are in the title. This column will then be grouped and compared with the profit per each different word count

moviespersonal$words_title <- str_count(moviespersonal$original_title, "\\w+")
moviespersonal %>%
  group_by(words_title) %>%
  summarize(mean_profit = mean(profit),
            sd_profit = sd(profit),
            median_profit = median(profit),
            quantile_profit = quantile(profit, 0.90),
            count = n()) %>%
  arrange(desc(mean_profit))
## # A tibble: 20 × 6
##    words_title mean_profit  sd_profit median_profit quantile_profit count
##          <int>       <dbl>      <dbl>         <dbl>           <dbl> <int>
##  1          10   64071620. 117714391.     47183112.       49509867.   127
##  2           8   57643350. 102907506.     47183112.       47183112.   510
##  3           7   54891330.  78345824.     47183112.       52947905    896
##  4          11   54448920.  62437300.     47183112.       60466534.    73
##  5          16   53851205.  11549475.     47183112.       63186534.     3
##  6          12   51830574.  33066573.     47183112.       47183112.    41
##  7           9   50407416.  49948258.     47183112.       47183112.   245
##  8          14   48806518.   5853273.     47183112.       47183112.    13
##  9           5   48545257.  56870932.     47183112.       47183112.  3575
## 10           6   48448381.  51368166.     47183112.       47183112.  1792
## 11           4   47456572.  44430688.     47183112.       47183112.  6254
## 12          15   47183112.         0      47183112.       47183112.     9
## 13          17   47183112.         0      47183112.       47183112.     2
## 14          18   47183112.         0      47183112.       47183112.     2
## 15          19   47183112.         0      47183112.       47183112.     3
## 16           3   46681369.  42229807.     47183112.       48787390.  9280
## 17           1   45982691.  54139116.     47183112.       60787390.  8056
## 18           2   45970169.  44364092.     47183112.       60927390. 11314
## 19          13   44236877.  10057832.     47183112.       47183112.    24
## 20          20   24350000         NA      24350000        24350000      1

Now we will group by year to see which years were the most profitable for cinema

moviespersonal %>%
  group_by(year) %>%
  summarize(mean_profit = mean(profit),
            sd_profit = sd(profit),
            median_profit = median(profit),
            quantile_profit = quantile(profit, 0.90),
            count = n()) %>%
  arrange(desc(mean_profit))
## # A tibble: 136 × 6
##    year  mean_profit sd_profit median_profit quantile_profit count
##    <chr>       <dbl>     <dbl>         <dbl>           <dbl> <int>
##  1 1902    68781405.       NA      68781405.       68781405.     1
##  2 2017    56221083. 81988788.     47183112.       64627390.   469
##  3 1905    51503961.  9661710.     47183112.       60145657.     5
##  4 2016    51400780. 69440583.     47183112.       63787390.  1442
##  5 2015    50584676. 81454255.     47183112.       58487390.  1744
##  6 1977    50302191. 46357949.     47183112.       47183112.   312
##  7 1904    50268366.  8162815.     47183112.       55821823.     7
##  8 2018    49824823.  4193441.     47183112.       54387390.     5
##  9 2012    49425227. 60546406.     47183112.       65787390.  1589
## 10 2011    49410075. 59204531.     47183112.       66987390.  1535
## # ℹ 126 more rows

The year with the most mean profit was 1902, the 2017, then 1905.

ggplot(moviespersonal, aes(x = year, y = profit)) + geom_density()+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

## Analyzing correlation between profit and other variables.

We will continue using the word count vs profit to find the correlation between them (if any)

ggplot(moviespersonal, aes(y = profit, x = words_title)) + geom_point() + geom_smooth(method = "lm", se = TRUE)
## `geom_smooth()` using formula = 'y ~ x'

ggplot(moviespersonal, aes(x = cut(words_title, breaks = 5), y = profit)) + geom_boxplot() + geom_smooth(method = "lm", se = TRUE)
## `geom_smooth()` using formula = 'y ~ x'

Now, we compute the Pearson coefficient of correlation

moviespersonal %>% 
  summarize(N = n(), r = cor(profit, words_title))
##       N          r
## 1 42220 0.03440706

The correlation between word count in the title and profit by the movie is extemely low / null, 0.03.

We will do this with the budget and the popularity to see if there are any positive correlations.

ggplot(moviespersonal, aes(y = profit, x = budget)) + geom_point() + geom_smooth(method = "lm", se = TRUE)
## `geom_smooth()` using formula = 'y ~ x'

moviespersonal %>% 
  summarize(N = n(), r = cor(budget,profit))
##       N         r
## 1 42220 0.4841505

Although still not a very strong correlation, budget and profit have a moderate correlation of 0.48.

ggplot(moviespersonal, aes(y = profit, x = revenue)) + geom_point() + geom_smooth(method = "lm", se = TRUE)
## `geom_smooth()` using formula = 'y ~ x'

moviespersonal %>% 
  summarize(N = n(), r = cor(revenue, profit))
##       N         r
## 1 42220 0.9745676
ggplot(moviespersonal, aes(y = profit, x = popularity)) + geom_point() + geom_smooth(method = "lm", se = TRUE)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

moviespersonal %>% 
  summarize(N = n(), r = cor(popularity, profit))
##       N  r
## 1 42220 NA

For some odd reason, we cannot see the correlation coefficient for popularity, so we will try with another variable.

moviespersonal <- moviespersonal %>% filter(runtime <= 250)
ggplot(moviespersonal, aes(y = profit, x = runtime)) + geom_point() + geom_smooth(method = "lm", se = TRUE)
## `geom_smooth()` using formula = 'y ~ x'

moviespersonal %>% 
  summarize(N = n(), r = cor(popularity, runtime))
##       N  r
## 1 42067 NA

To get rid of movies that are longer than 4 hours in runtime, we used the filter function. Again, we cannot see the value for the coefficient, but from seeing the line and the points, we can predict that this positive correlation has a moderate correlation.

ggplot(movies_with_separated_genres, aes(x = separated_genres, y = revenue)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + # Ajustar el texto del eje x para mejor visualización
  labs(x = "Genre", y = "Revenue", title = "Boxplot of Revenue by Genre")

We can see that the revenues per genre are really spread, that’s because of the data quantity. We can also notice that Foreign, Movie and TV having really low revenue and Action, adventure, fantasy fiction and science have extreme outliers since the movie avatar has this genres and is the highest revenue of this year.

Now we analize the new column that we created named profit

movies_with_separated_genres %>%
  group_by(separated_genres) %>%
  summarize(
    mean_profit = mean(profit, na.rm = TRUE),
    sd_profit = sd(profit, na.rm = TRUE),
    median_profit = median(profit, na.rm = TRUE),
    quantile_90 = quantile(profit, 0.90, na.rm = TRUE),
    count = n()
  ) %>%
  arrange(desc(mean_profit))
## # A tibble: 23 × 6
##    separated_genres mean_profit  sd_profit median_profit quantile_90 count
##    <chr>                  <dbl>      <dbl>         <dbl>       <dbl> <int>
##  1 "Adventure"        71428203. 134477234.     47183112.   99600000   3231
##  2 "Fantasy"          65023884. 124170024.     47183112.   68739390.  2119
##  3 "Animation"        62345097.  95348486.     47183112.   67987390.  1846
##  4 "Family"           61621163.  97836888.     47183112.   67507390.  2590
##  5 "Fiction"          57536350. 104793345.     47183112.   68193990.  2829
##  6 "Science"          57536350. 104793345.     47183112.   68193990.  2829
##  7 "Action"           54819226.  91432796.     47183112.   66787390.  6119
##  8 "Movie"            47984308.   3848747.     47183112.   47183112.   673
##  9 "TV"               47984308.   3848747.     47183112.   47183112.   673
## 10 ""                 47093271.   6241410.     47183112.   47183112.  2280
## # ℹ 13 more rows

Skewness and Log Transformation Analysis

## Skewness of revenue: 12.78445
## Skewness of budget: 7.167064

Distribution of original and transformed Budget

Original Budget

Density of Original Budget The original budget distribution is highly right-skewed, indicating that most movies have a low budget, while a few movies have a significantly higher budget. The long tail to the right shows that there are outliers with exceptionally large budgets compared to the rest.

Transformed Budget

Density of Log-Transformed Budget The log transformation has normalized the distribution of budget values, as evidenced by the peak around 15 to 17 on the log_budget scale. This suggests that a majority of movies have a budget within a mid-range when the scale is logarithmic, reducing the influence of extreme values.

Distribution of original and transformed Revenue

Original Revenue

Density of Original Revenue The density plot for the original revenue values demonstrates extreme right skewness, where the majority of movies earn revenue in the lower range, while very few have extraordinarily high revenues, creating a long right tail. This skewness could be due to blockbuster hits which are rare but generate a massive amount of revenue.

Transformed Revenue

Density of Log-Transformed Revenue After the logarithmic transformation, the revenue data shows a more bell-shaped distribution, though there is still a slight right skew. This indicates a more normalized distribution, but with a few outliers with exceptionally high revenue, visible from the tail extending to the right.

Summarized data

We now show the summarized and transformed data to see the central tendency an variability from Revenue and Budget

##   mean_log_revenue median_log_revenue iqr_log_revenue mean_log_budget
## 1         17.70161           18.04653               0        16.58778
##   median_log_budget iqr_log_budget
## 1           16.8884              0
mean_revenue <- mean(moviespersonal$revenue, na.rm = TRUE)
sd_revenue <- sd(moviespersonal$revenue, na.rm = TRUE)

n10 <- rnorm(10, mean = mean_revenue, sd = sd_revenue)
n100 <- rnorm(100, mean = mean_revenue, sd = sd_revenue)
n1000 <- rnorm(1000, mean = mean_revenue, sd = sd_revenue)
n10000 <- rnorm(10000, mean = mean_revenue, sd = sd_revenue)

hist(n10, breaks = 5, main = "Histogram of n10")

hist(n100, breaks = 30, main = "Histogram of n100")

hist(n1000, breaks = 40, main = "Histogram of n1000")

hist(n10000, breaks = 100, main = "Histogram of n10000")

value_to_evaluate <- 5
probability <- pnorm(value_to_evaluate, mean = mean_revenue, sd = sd_revenue, lower.tail = TRUE)
print(probability)
## [1] 0.1210488