The movies dataset from Kaggle provides a long list of movies from several different countries. Other than the names of the movies, the dataset includes the run time, original language, status, tagline, etc, and our most important variables, revenue and budget. We believe these are our critical values because with these we can create, or mutate a new column called profit. With profit declared, we can begin to analyze our problem setup question:
What makes a movie successful or unsuccessful?
We analyzed the data using statistical tools like Pearson’s
Coefficient of Correlation to compare various variables.
1. Words in the title - Profit, r^2 = 0.03, “Extremely low/null”
2. Budget - Profit, r^2 = 0.48, “Moderate”
3. Revenue - Profit, r^2 = 0.97, “Very strong”
3. Popularity - Profit, r^2 = “NULL”
4. Run time(<250minutes) - Profit, r^2 = “NULL”
The years with the most profit.
1. 1902
2. 2017
3. 1905,
while the lowest was 1988.
The genres with the most profit
1. Adventure
2. Fantasy
3. Animation,
while the lowest was History.
Vote Averages with the most revenue
1. 7.9
2. 8.1
3. 9.1
Interestingly, 10.0 vote is in 43rd place.
Word count in titles with the most profit
1. 10
2. 8
3. 7
This evidence is focused on the analysis of the Kaggle database (movie database), this dataset provided us with certain variables that we will be working with such as the duration of the movies, their income, budgets, etc. . Our main objective would be to know the reason why films fail or become commercially successful. This can be measured with the profit variable that is calculated as the difference between income and budget. In order to achieve this, we will have to use certain data cleaning and preparation techniques, which will be followed by extensive analysis. In addition to examining trends by genre and year, in order to recognize the patterns of success, with the purpose of understanding what elements contribute to the commercial success of films within the film industry.
The creation of our profit variable with the intention of being able to determine what the profit of each film would be.
Doing a genre and year analysis in order to determine which genres and years were the most profitable.
Clean and prepare the data to be able to handle missing values and errors, in order to make sure that the data is in the right format for analysis.
Run an in-depth analysis to recognize what patterns, anomalies and trends are in the data.
To answer the question of what makes a movie successful or
unsuccesful, we first cleaned the data using functions like -
filter
- as.factor/character/numeric etc.
- vis_miss
- group_by
- mutate
First, we read the CSV provided for the problem setup.
moviespersonal <- read.csv("C:\\Users\\Usuario\\Downloads\\Movies dataset\\movies_metadata.csv")
We call the packages we plan to use to clean the data provided.
#install.packages("jsonlite")
library("dplyr", quietly = TRUE)
library("stringr", quietly = TRUE)
# We also install the packages we will use to create graphs later
#install.packages("esquisse")
#install.packages("ggplot2")
library("ggplot2", quietly = TRUE)
#install.packages("tidyverse")
library("tidyverse", quietly = TRUE)
We run a summary on the dataset to see if we can identify anything that we need to “clean”
summary(moviespersonal)
## adult belongs_to_collection budget genres
## Length:45466 Length:45466 Length:45466 Length:45466
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## homepage id imdb_id original_language
## Length:45466 Length:45466 Length:45466 Length:45466
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## original_title overview popularity poster_path
## Length:45466 Length:45466 Length:45466 Length:45466
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## production_companies production_countries release_date
## Length:45466 Length:45466 Length:45466
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## revenue runtime spoken_languages status
## Min. :0.000e+00 Min. : 0.00 Length:45466 Length:45466
## 1st Qu.:0.000e+00 1st Qu.: 85.00 Class :character Class :character
## Median :0.000e+00 Median : 95.00 Mode :character Mode :character
## Mean :1.121e+07 Mean : 94.13
## 3rd Qu.:0.000e+00 3rd Qu.: 107.00
## Max. :2.788e+09 Max. :1256.00
## NA's :6 NA's :263
## tagline title video vote_average
## Length:45466 Length:45466 Length:45466 Min. : 0.000
## Class :character Class :character Class :character 1st Qu.: 5.000
## Mode :character Mode :character Mode :character Median : 6.000
## Mean : 5.618
## 3rd Qu.: 6.800
## Max. :10.000
## NA's :6
## vote_count
## Min. : 0.0
## 1st Qu.: 3.0
## Median : 10.0
## Mean : 109.9
## 3rd Qu.: 34.0
## Max. :14075.0
## NA's :6
At first glance, we see a couple of columns that are in the wrong
data class.
adult, character -> factor. (True/False)
budget, chr -> numeric/integer
id, chr -> integer
popularity, chr -> numeric
release_date, chr -> date
revenue, chr -> numeric
runtime, chr -> numeric
status, chr -> factor (Released/Unreleased)
We need to change these into their correct class to avoid any problem in the analysis.
Furthermore, we see some outliers in the data that don’t allow for a
correct analysis.
Lines 19731, 29504, and 35588 have errors in every single column. Seeing
that we have 45,466 rows of data, our findings will not be affected if
we remove these 3 errors.
moviespersonal <- moviespersonal[-c(19731, 29504, 35588), ]
Once this is done, we can finally turn the columns into their correct data class.
moviespersonal$adult <- as.factor(moviespersonal$adult)
moviespersonal$budget <- as.numeric(moviespersonal$budget)
moviespersonal$id <- as.integer(moviespersonal$id)
moviespersonal$popularity <- as.numeric(moviespersonal$popularity)
moviespersonal$release_date <- as.Date(moviespersonal$release_date)
moviespersonal$revenue <- as.numeric(moviespersonal$revenue)
moviespersonal$runtime <- as.numeric(moviespersonal$runtime)
moviespersonal$status <- as.factor(moviespersonal$status)
Now that they are in their correct data type/class, we can begin to
change the way the data is shown.
For example, the column “genres” not only contains the genres of the
movie, but also a lot of seemingly random punctuation marks. “[{‘id’:
28, ‘name’: ‘Action’}, {‘id’: 12, ‘name’: ‘Adventure’},
The punctuation makes the data seem messy and not easy to read.
Now we are creating a new column to only show the genres to make it easy to read.
## [1] "'name': 'Animation', 'name': 'Comedy', 'name': 'Family'"
## [2] "'name': 'Adventure', 'name': 'Fantasy', 'name': 'Family'"
## [3] "'name': 'Romance', 'name': 'Comedy'"
## [4] "'name': 'Comedy', 'name': 'Drama', 'name': 'Romance'"
## [5] "'name': 'Comedy'"
## [6] "'name': 'Action', 'name': 'Crime', 'name': 'Drama', 'name': 'Thriller'"
We can now see the genres on a cleaner format but it is also showing name, now to get rid of this unnecessary text we run the following code.
## [1] "Animation Comedy Family" "Adventure Fantasy Family"
## [3] "Romance Comedy" "Comedy Drama Romance"
## [5] "Comedy" "Action Crime Drama Thriller"
Now we can clearly read the genres of the movie on a new column called genre_names. but they still are as chr, we need to convert them to factor to properly analyze the data.
## (Other) Drama Comedy Documentary
## 13695 5000 3621 2723
Now as we read the values from the dataframe we notice that on revenue there are a lot of values as 0, which are possibly registered as such because the revenue data was not possible to find or because of a different error. This error can mess up the statistics and drop the mean value to a lower number and affect the analysis of this year movies.
To solve this we will replace the values from 0 to the Revenue’s mean
mean_revenue <- mean(moviespersonal$revenue[moviespersonal$revenue > 0])
# Replace the zero values with the mean revenue
moviespersonal$revenue[moviespersonal$revenue == 0] <- mean_revenue
Now we also replace NA values with the column’s mean
# Check if there are any NA values in the revenue column
any_na <- any(is.na(moviespersonal$revenue))
# If there are NA values, replace them with the mean revenue (excluding NA and zero values)
if(any_na) {
# Recalculate the mean in case the zero replacement affected it, excluding NA values
mean_revenue <- mean(moviespersonal$revenue[moviespersonal$revenue > 0], na.rm = TRUE)
# Replace NA values with the new mean revenue
moviespersonal$revenue[is.na(moviespersonal$revenue)] <- mean_revenue
}
We repeat the same process now on the column Budget.
# Calculate the budget's mean value excluding NA's and zero's
mean_budget <- mean(moviespersonal$budget[moviespersonal$budget > 0], na.rm = TRUE)
# Replace zero's with the budget's mean value
moviespersonal$budget[moviespersonal$budget == 0] <- mean_budget
# Replace NA's with the budget's mean value
moviespersonal$budget[is.na(moviespersonal$budget)] <- mean_budget
For this next part, we are doing a DataExplorer report of the Movies database.
# We first install and call the DataExplorer library
#install.packages("DataExplorer")
library(DataExplorer)
# Now we can create a report using the library and the database
create_report(moviespersonal)
This past function gives a general overview of the data contained in the database: basic statistics of the values in the database, data structures, histograms, etc. The general overview will be displayed in html format in a browser tab.
As shown in the principal component analysis, there are 43050 categories in the original title column, since shouldn’t be possible since there shouldn’t be repeat movies in the database.
The principal component analysis shows us how there is only 45089 categories in the IMDb id column, which may indicate that some movies may not be in the International Movie Database, thus these movies may lack information such as average score (based on vote_average) and how many reviews it has (based on vote_count)
There is also some movies which are lacking an overview (or a synopsis). This particular column however may not result as efficient to analize in the longrun since it would mean to analize each overview individually, and with more than 40 thousand objects this task results extremely difficult.
In our DataExplorer report we found out that runtime is the only column that still has has missing values that can actually be corrected, we will now analyze the data and use the package Mice to fill out missing values.
# We use the visdat package
library(visdat)
summary(moviespersonal$runtime)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 85.00 95.00 93.78 107.00 1256.00 251
vis_miss(moviespersonal, warn_large_data = FALSE)
Now we see only release date is the noticeable missing data but it can’t
be solved since those are dates.
moviespersonal <- moviespersonal %>%
mutate(missing_inf = is.na(runtime))
moviespersonal %>%
group_by(missing_inf) %>%
summarize(avg_popularity = mean(popularity))
## # A tibble: 2 × 2
## missing_inf avg_popularity
## <lgl> <dbl>
## 1 FALSE 2.93
## 2 TRUE NA
We notice it is difficult to come to a conclusion
#install.packages("mice")
library("mice", quietly = TRUE)
#method_list <- rep("", ncol(moviespersonal))
# Assign the names of the columns from the moviespersonal dataset to the method_list
#names(method_list) <- colnames(moviespersonal)
# Set the method for the 'runtime' column to "pmm" (predictive mean matching)
#method_list["runtime"] <- "pmm"
# Now run the mice function with the specified method_list
#accounts_mice <- mice(moviespersonal, m=1, maxit=50, meth="pmm", seed=500)
Mice can’t solve this missing values since they are to small of a sample, now we use mean to solve the missing values.
# Calculate the mean of the runtime column, excluding NA values
runtime_mean <- mean(moviespersonal$runtime, na.rm = TRUE)
# Replace NA values in the runtime column with the mean
moviespersonal <- moviespersonal %>%
mutate(runtime = ifelse(is.na(runtime), runtime_mean, runtime))
moviespersonal %>%
arrange(popularity) %>%
vis_miss(warn_large_data = FALSE)
runtime_mean <- mean(moviespersonal$runtime, na.rm = TRUE)
moviespersonal <- moviespersonal %>%
mutate(runtime = ifelse(is.na(runtime), runtime_mean, runtime))
moviespersonal %>%
arrange(popularity) %>%
vis_miss(warn_large_data = FALSE)
In the project we carried out, we began by analyzing the information, cleaning it and deleting null values, or duplicates, since having the data clean is the longest and most important part so that the information we analyze next is of good quality. We also identified the incorrect data types and their outliers and removed the anomalies. Gender data is specifically addressed, cleaning it from a JSON-like string format to a more readable form, which is crucial for any further analysis.
It also handles zero values in the revenue and budget columns, and replaces them with the mean of the non-zero values, ensuring that no NA values are left untreated. This step is intended to prevent the analysis from being biased due to missing or unreported financial data.
An important part of the process we perform is ensuring data uniqueness by removing duplicate movie titles, which is crucial for accurate movie counts in your analysis.
In summary the work indicates a meticulous and systematic approach to prepare the data set for robust statistical analysis and possible model construction. However, the method we chose to address missing and zero values could be further examined to ensure it is the most appropriate approach, as it may have implications for the integrity of the analysis, especially if the zeros have real meaning within the set. of data. After this initial data preparation, you will be ready for exploratory data analysis to uncover insights and inform any hypotheses or models you plan to build.
We analyze Adult movies by status
table1 <- table(moviespersonal$status, moviespersonal$adult)
prop.table(table1) * 100
##
## False True
## 0.182378020 0.000000000
## Canceled 0.004737091 0.000000000
## In Production 0.042633823 0.002368546
## Planned 0.033159640 0.000000000
## Post Production 0.215537660 0.000000000
## Released 99.005210801 0.018948366
## Rumored 0.495026054 0.000000000
We notice there are only 1% of adult rated films released
Now we graphically see the difference
Also, for this particular part of the setup, we will be using the Esquisse package, which allows us to create more detailed and, in all fairness, prettier graphics than ggplot2
It is important to note that rather than different than calling a library, Esquisse is used by interacting with the Addins menu on the top of the screen
moviespersonal$year <- substr(moviespersonal$release_date, 1, 4)
In this section, we focused on generating tables and visualizations to glean insights from the dataset. Our analysis began with the creation of contingency tables to explore the relationship between movie status and adult content, as well as between video availability and adult content. These tables provide a comprehensive overview of the distribution of these variables within the dataset.
Furthermore, we leveraged the power of data visualization to gain deeper insights. Using the ggplot2 package, we crafted a bar plot to visually represent the distribution of movie status across different adult content categories. This graphical representation offers a clear depiction of how movie status varies with respect to adult content.
Additionally, we utilized scatter plots to examine relationships between key variables. Specifically, we explored the relationship between average vote and revenue, as well as between budget and revenue. These scatter plots enable us to visualize potential trends or patterns within the data, shedding light on any underlying relationships between these variables.
In summary, our analysis in this section underscores the importance of both tabular and graphical representations in uncovering insights from the dataset. Through careful examination of tables and visualizations, we gain a deeper understanding of the relationships and trends present in the data, laying the foundation for further analysis and interpretation.
# With this particular line of code, we are calculating the profit of each movie
moviespersonal <- moviespersonal %>%
mutate(profit= revenue - budget)
With this code, we are computing means by groups, in which we are checking the mean revenue based on the score on the Internet Movie Database (IMDb). This with the intention to prove that not necessarily because a movie has a high score online it’s going to perform well in the box office regard.
# This line of code helps us with stopping the numbers from appearing in scientific notation
options(scipen = 999)
# Now we code the mean revenue based on the average score on IMDb
mean_revenue <- moviespersonal %>%
group_by(vote_average) %>%
summarize(mean_revenue = mean(revenue, na.rm = TRUE)) %>%
arrange(desc(mean_revenue))
# With this we print the function
print(mean_revenue)
## # A tibble: 92 × 2
## vote_average mean_revenue
## <dbl> <dbl>
## 1 7.9 87140789.
## 2 8.1 87125416.
## 3 9.1 84393695.
## 4 7.6 82704986.
## 5 8.2 79617362.
## 6 8.3 78486765.
## 7 7.5 77187057.
## 8 7.4 75553881.
## 9 7.7 73748020.
## 10 6.7 72205531.
## # ℹ 82 more rows
With this code we are going to find out the mean revenue per genre, since the genres are summed to each other as movies can have more than one genre we first separate de genres an then count the revenue and get the mean if the genre is in the movie.
# Create a new column with separated genres into new rows, keeping the original 'genre_names' column unchanged
movies_with_separated_genres <- moviespersonal %>%
mutate(separated_genres = genre_names) %>%
separate_rows(separated_genres, sep = " ") %>%
select(genre_names, separated_genres, revenue, profit)
# Get unique genres from the new 'separated_genres' column
unique_genres <- movies_with_separated_genres %>%
select(separated_genres) %>%
distinct() %>%
pull(separated_genres)
# Initialize a dataframe to store the average revenue for each unique genre
mean_revenue_by_genre <- data.frame(genre = character(), mean_revenue = numeric(), stringsAsFactors = FALSE)
# Iterate over each unique genre to calculate the average revenue
for(genre in unique_genres) {
mean_revenue <- movies_with_separated_genres %>%
# Select rows where the new 'separated_genres' column contains the current genre
filter(str_detect(separated_genres, fixed(genre))) %>%
# Calculate the average revenue for the selected genre
summarize(mean_revenue = mean(revenue, na.rm = TRUE)) %>%
# Extract the average revenue value
pull(mean_revenue)
# Append the genre and its average revenue to the dataframe
mean_revenue_by_genre <- rbind(mean_revenue_by_genre, data.frame(genre = genre, mean_revenue = mean_revenue))
}
# Display the result
print(mean_revenue_by_genre)
## genre mean_revenue
## 1 Animation 89334110
## 2 Comedy 67805376
## 3 Family 89513534
## 4 Adventure 103618594
## 5 Fantasy 94402872
## 6 Romance 64738395
## 7 Drama 63925945
## 8 Action 81127887
## 9 Crime 66435133
## 10 Thriller 68885257
## 11 Horror 64464103
## 12 History 64717137
## 13 Science 83469618
## 14 Fiction 83469618
## 15 Mystery 67625498
## 16 War 68262127
## 17 Foreign 65258394
## 18 NaN
## 19 Music 65440574
## 20 Documentary 65198337
## 21 Western 65258973
## 22 TV 68747587
## 23 Movie 68747587
Using e1071 package
library("e1071")
mean(moviespersonal$profit)
## [1] 47103572
var(moviespersonal$profit)
## [1] 2482584026353584
sd(moviespersonal$profit)
## [1] 49825536
median(moviespersonal$profit)
## [1] 47183112
quantile(moviespersonal$profit)
## 0% 25% 50% 75% 100%
## -111007242 47183112 47183112 47183112 2550965087
min(moviespersonal$profit)
## [1] -111007242
max(moviespersonal$profit)
## [1] 2550965087
range(moviespersonal$profit)
## [1] -111007242 2550965087
skewness(moviespersonal$profit)
## [1] 13.28206
We now know that the distribution is skewed to the right. With this data, we can compute the confidence interval and correlation coefficient.
profit_mean <-mean(moviespersonal$profit)
profit_sd <-sd(moviespersonal$profit)
profit_dev <- c(profit_mean - profit_sd, profit_mean + profit_sd)
profit_dev
## [1] -2721964 96929108
The profit +- the standard deviation is saved in the object “profit_dev”
#Coefficient of variation
CV <- sd(moviespersonal$profit)/mean(moviespersonal$profit)
CV
## [1] 1.057787
#Trimmed mean
mean(moviespersonal$profit, trim=.1)
## [1] 46343198
#Z scores
#"scale" automatically gives you the zscores
moviespersonal <- moviespersonal %>%
mutate(zscore = scale(moviespersonal$profit))
moviespersonal %>%
filter(zscore < 3)
# We removed all rows with more than 3 zscores (438 rows)
Now, we will add a new column called words_title which will tell us how many words are in the title. This column will then be grouped and compared with the profit per each different word count
moviespersonal$words_title <- str_count(moviespersonal$original_title, "\\w+")
moviespersonal %>%
group_by(words_title) %>%
summarize(mean_profit = mean(profit),
sd_profit = sd(profit),
median_profit = median(profit),
quantile_profit = quantile(profit, 0.90),
count = n()) %>%
arrange(desc(mean_profit))
## # A tibble: 20 × 6
## words_title mean_profit sd_profit median_profit quantile_profit count
## <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 10 64071620. 117714391. 47183112. 49509867. 127
## 2 8 57643350. 102907506. 47183112. 47183112. 510
## 3 7 54891330. 78345824. 47183112. 52947905 896
## 4 11 54448920. 62437300. 47183112. 60466534. 73
## 5 16 53851205. 11549475. 47183112. 63186534. 3
## 6 12 51830574. 33066573. 47183112. 47183112. 41
## 7 9 50407416. 49948258. 47183112. 47183112. 245
## 8 14 48806518. 5853273. 47183112. 47183112. 13
## 9 5 48545257. 56870932. 47183112. 47183112. 3575
## 10 6 48448381. 51368166. 47183112. 47183112. 1792
## 11 4 47456572. 44430688. 47183112. 47183112. 6254
## 12 15 47183112. 0 47183112. 47183112. 9
## 13 17 47183112. 0 47183112. 47183112. 2
## 14 18 47183112. 0 47183112. 47183112. 2
## 15 19 47183112. 0 47183112. 47183112. 3
## 16 3 46681369. 42229807. 47183112. 48787390. 9280
## 17 1 45982691. 54139116. 47183112. 60787390. 8056
## 18 2 45970169. 44364092. 47183112. 60927390. 11314
## 19 13 44236877. 10057832. 47183112. 47183112. 24
## 20 20 24350000 NA 24350000 24350000 1
Now we will group by year to see which years were the most profitable for cinema
moviespersonal %>%
group_by(year) %>%
summarize(mean_profit = mean(profit),
sd_profit = sd(profit),
median_profit = median(profit),
quantile_profit = quantile(profit, 0.90),
count = n()) %>%
arrange(desc(mean_profit))
## # A tibble: 136 × 6
## year mean_profit sd_profit median_profit quantile_profit count
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1902 68781405. NA 68781405. 68781405. 1
## 2 2017 56221083. 81988788. 47183112. 64627390. 469
## 3 1905 51503961. 9661710. 47183112. 60145657. 5
## 4 2016 51400780. 69440583. 47183112. 63787390. 1442
## 5 2015 50584676. 81454255. 47183112. 58487390. 1744
## 6 1977 50302191. 46357949. 47183112. 47183112. 312
## 7 1904 50268366. 8162815. 47183112. 55821823. 7
## 8 2018 49824823. 4193441. 47183112. 54387390. 5
## 9 2012 49425227. 60546406. 47183112. 65787390. 1589
## 10 2011 49410075. 59204531. 47183112. 66987390. 1535
## # ℹ 126 more rows
The year with the most mean profit was 1902, the 2017, then 1905.
ggplot(moviespersonal, aes(x = year, y = profit)) +
geom_density() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ggtitle("Profit by Year")
We will continue using the word count vs profit to find the correlation between them (if any)
## `geom_smooth()` using formula = 'y ~ x'
The linear model applied to the data suggests a relatively flat slope,
indicating that there is minimal change in profit as the number of words
in the title increases.
Now, we compute the Pearson coefficient of correlation
moviespersonal %>%
summarize(N = n(), r = cor(profit, words_title))
## N r
## 1 42220 0.03440706
The correlation between word count in the title and profit by the movie is extemely low / null, 0.03.
We will do this with the budget and the popularity to see if there are any positive correlations.
## `geom_smooth()` using formula = 'y ~ x'
## N r
## 1 42220 0.4841505
Although still not a very strong correlation, budget and profit have a moderate correlation of 0.48.
Now using Profit and Revenue
## `geom_smooth()` using formula = 'y ~ x'
moviespersonal %>%
summarize(N = n(), r = cor(revenue, profit))
## N r
## 1 42220 0.9745676
We obviously see a strong correlation
Now using Profit and Popularity
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## N r
## 1 42220 NA
For some odd reason, we cannot see the correlation coefficient for popularity, so we will try with another variable.
## `geom_smooth()` using formula = 'y ~ x'
## N r
## 1 42067 0.04163259
To get rid of movies that are longer than 4 hours in runtime, we used the filter function. Again, we cannot see the value for the coefficient, but from seeing the line and the points, we can predict that this positive correlation has a moderate correlation.
We can see that the revenues per genre are really spread, that’s because of the data quantity. We can also notice that Foreign, Movie and TV having really low revenue and Action, adventure, fantasy fiction and science have extreme outliers since the movie avatar has this genres and is the highest revenue of this year.
Now we analize the new column that we created named profit
## # A tibble: 23 × 6
## separated_genres mean_profit sd_profit median_profit quantile_90 count
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 "Adventure" 71428203. 134477234. 47183112. 99600000 3231
## 2 "Fantasy" 65023884. 124170024. 47183112. 68739390. 2119
## 3 "Animation" 62345097. 95348486. 47183112. 67987390. 1846
## 4 "Family" 61621163. 97836888. 47183112. 67507390. 2590
## 5 "Fiction" 57536350. 104793345. 47183112. 68193990. 2829
## 6 "Science" 57536350. 104793345. 47183112. 68193990. 2829
## 7 "Action" 54819226. 91432796. 47183112. 66787390. 6119
## 8 "Movie" 47984308. 3848747. 47183112. 47183112. 673
## 9 "TV" 47984308. 3848747. 47183112. 47183112. 673
## 10 "" 47093271. 6241410. 47183112. 47183112. 2280
## # ℹ 13 more rows
These results suggest that while certain genres like “Adventure,” “Fantasy,” and “Animation” have the potential for very high profits, this may be because of Avatar having this genres but we will visualize it on the next part, there is significant variability within each genre. Genres like “Drama” and “Comedy” are prolific but tend to have lower average profits. This analysis can inform movie production decisions, particularly if the goal is to maximize profitability by focusing on genres with higher earning potential.
## Skewness of revenue: 12.78445
## Skewness of budget: 7.167064
Density of Original Budget The original budget distribution is highly right-skewed, indicating that most movies have a low budget, while a few movies have a significantly higher budget. The long tail to the right shows that there are outliers with exceptionally large budgets compared to the rest.
Density of Log-Transformed Budget The log transformation has normalized the distribution of budget values, as evidenced by the peak around 15 to 17 on the log_budget scale. This suggests that a majority of movies have a budget within a mid-range when the scale is logarithmic, reducing the influence of extreme values.
Density of Original Revenue The density plot for the original revenue values demonstrates extreme right skewness, where the majority of movies earn revenue in the lower range, while very few have extraordinarily high revenues, creating a long right tail. This skewness could be due to blockbuster hits which are rare but generate a massive amount of revenue.
Density of Log-Transformed Revenue After the logarithmic transformation, the revenue data shows a more bell-shaped distribution, though there is still a slight right skew. This indicates a more normalized distribution, but with a few outliers with exceptionally high revenue, visible from the tail extending to the right.
We now show the summarized and transformed data to see the central tendency an variability from Revenue and Budget
## mean_log_revenue median_log_revenue iqr_log_revenue mean_log_budget
## 1 17.70161 18.04653 0 16.58778
## median_log_budget iqr_log_budget
## 1 16.8884 0
Now we will use rnorm to visualize a simulated histogram for profit
# The following code is used to examine the distribution of revenues produced by the movies through the simulations.
# We visualize this distribution as a histogram to make it easier to understand the data and to calculate the probabilities that the movies generate less than a certain amount of revenues, assuming that these revenues are distributed in a regular way.
mean_revenue <- mean(moviespersonal$revenue, na.rm = TRUE)
sd_revenue <- sd(moviespersonal$revenue, na.rm = TRUE)
n10000 <- rnorm(10000, mean = mean_revenue, sd = sd_revenue)
# Create a histogram
hist(n10000, breaks = 100, main = "Histogram of Simulated Revenues",
xlab = "Revenue", ylab = "Frequency", col = "lightblue")
value_to_evaluate <- 5
probability <- pnorm(value_to_evaluate, mean = mean_revenue, sd = sd_revenue, lower.tail = TRUE)
# Display the probability result
cat("The probability that the revenue is less than", value_to_evaluate, "is:", probability, "\n")
## The probability that the revenue is less than 5 is: 0.1210488
The code identifies on average how much the films earn and how much the earnings of each film vary, with some films earning a lot and others not generating much revenue. This is followed by the simulation of how the earnings would be in small groups of movies in order to see if the variation and the average remain steady. Once we have this information we create the histograms to show graphically how the earnings will be distributed depending on the simulated group, i.e. which are the movies that have a lot of earnings and which are the ones that have few. Finally, we calculate how likely it is that a film can generate less than 5 million dollars, based on the calculated variation and the average.
Overall, this distribution suggests that while extreme financial outcomes are possible in the movie industry, most movies hover around a moderate revenue range, with an equal likelihood of small profits or losses.