In my Case Study, I have investigated a dataset containing information about 10,000 movies collected from The Movie Database (TMDb).
I downloaded the TMDB 5000 Movie Dataset which is divided into two parts tmdb_5000_Credits and tmdb_5000_Movies and we saved it as ‘Credits.csv’, ‘Movies.csv’.
In this project, I will provide various visualizations to identify patterns in the data.
The orginal dataset click here In this analyzing, the following questions will be answered: ## Exercise 1 Does highly rated movies are associated with revenue ?
1.Let’s import all the necessary libraries to be used ahead
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
df1 <- read_csv("tmdb_5000_credits.csv")
## Rows: 4803 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): title, cast, crew
## dbl (1): movie_id
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df2 <- read_csv("tmdb_5000_movies.csv")
## Rows: 4803 Columns: 20
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (12): genres, homepage, keywords, original_language, original_title, ov...
## dbl (7): budget, id, popularity, revenue, runtime, vote_average, vote_count
## date (1): release_date
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
x1.df = merge (df1,df2)
x1.df%>%select(title,vote_average,vote_count,budget,revenue)
x1.df%>%select(title,vote_average,vote_count,budget,revenue) %>%group_by(revenue)
6.Get a total Number of vote_average by using the summarize() function
x1.df%>%select(title,vote_average,vote_count,budget,revenue) %>% group_by(revenue) %>% summarize(vote_average)
## `summarise()` has grouped output by 'revenue'. You can override using the `.groups` argument.
x1.df%>%select(title,vote_average,vote_count,budget,revenue) %>%group_by(revenue) %>% summarize(vote_average) %>%ggplot(mapping = aes(x=vote_average, y=log(revenue))) + geom_point() + geom_smooth(method=lm, se=TRUE) + ggtitle("vote_average vs revenue") + xlab("revenue") + ylab("Number of vote_average")
## `summarise()` has grouped output by 'revenue'. You can override using the `.groups` argument.
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1430 rows containing non-finite values (stat_smooth).
In this exercise we’ll determine whether highly rated movies are associated with Budget or not ?
The tmdb_5000_credits.csv file contains a vote_average column that defines the number of vote for each movies .
x1.df%>%select(budget,vote_average,vote_count,revenue) %>%group_by(budget) %>% summarize(vote_average) %>%ggplot(mapping = aes(x=vote_average, y=log(budget))) + geom_point() + geom_smooth(method=lm, se=TRUE) + ggtitle("Vote_average vs Budget ") + xlab("Budget ") + ylab("Number of vote_average")
## `summarise()` has grouped output by 'budget'. You can override using the `.groups` argument.
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1039 rows containing non-finite values (stat_smooth).
x1.df%>%select(budget,vote_average,vote_count,revenue) %>%group_by(budget) %>% summarize(vote_average,) %>%ggplot(mapping = aes(x=vote_average, y=log(budget))) + geom_violin() +geom_boxplot(width=0.1) + ggtitle("Vote_average vs Budget ") + xlab("Budget ") + ylab("Number of vote_average")
## `summarise()` has grouped output by 'budget'. You can override using the `.groups` argument.
## Warning: Removed 1039 rows containing non-finite values (stat_ydensity).
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 1039 rows containing non-finite values (stat_boxplot).
x1.df%>%select(popularity,runtime,vote_count) %>%group_by(runtime,popularity) %>% summarize(runtime,mean(popularity)) %>% ggplot() + geom_col(mapping = aes(x=runtime, y=popularity), fill="pink")
## `summarise()` has grouped output by 'runtime', 'popularity'. You can override using the `.groups` argument.
## Warning: Removed 2 rows containing missing values (position_stack).
1.library(readr) library(dplyr) library(ggplot2)
For this script we’ll filter so that only Released and Post Production are selected from the status column .
And then Group the data by status and YR.
summarize the budget .
x1.df%>%select(budget,release_date,status) %>% filter(status%in% c('Released', 'Post Production'))%>%group_by(release_date,status) %>%summarize(budget)%>%ggplot(mapping = aes(x=release_date, y=log(budget),colour=status)) + geom_point() + geom_smooth(method=lm, se=TRUE) + ggtitle("Which Year Has The Highest budget") + xlab("Year") + ylab("Log of Total budget")
## `summarise()` has grouped output by 'release_date', 'status'. You can override using the `.groups` argument.
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1035 rows containing non-finite values (stat_smooth).
## Warning in qt((1 - level)/2, df): NaNs produced
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
1.library(readr) library(dplyr) library(ggplot2)
I only need a few of the columns from the data frame for this exercise so use the select() function to retrieve the Popularity, revenue columns.
And then Group the data by popularity
summarize the revenue .
x1.df%>%select(popularity, revenue) %>%group_by(popularity ) %>% summarize(revenue) %>%ggplot(mapping = aes(x=popularity , y=log(revenue))) + geom_point() + geom_smooth(method=lm, se=TRUE) + ggtitle("Popularity Vs revenue") + xlab("revenue") + ylab("Number of Popularity ")
## `summarise()` has grouped output by 'popularity'. You can override using the `.groups` argument.
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1430 rows containing non-finite values (stat_smooth).
This analysis had as purpose to perform an analysis of a database of information about 10,000 movies collected from The Movie Database (TMDb) . After read two dataset , I did merge the two dataframes to make a one dataframe. Then, I did plotted charts to highly rated movies are associated with revenue and also with dudget. Next, I did plotted charts for which length movies most liked by the audiences according to their popularity .
Finally, I did plotted charts for highest budget year and how does Popularity depends on revenue.
(https://github.com/LubnaAlhenaki/Data_Analyst_Nanodegree/blob/main/Project_2/Project2-LubnaAlhenaki.ipynb) (https://github.com/reemamohsin4/TMDb-Movie-Data/blob/master/Investigate_a_Dataset.ipynb) (https://github.com/divyachandramouli/Investigate_TMDb_Movie_Dataset/blob/master/DC_Investigate_TMDb_movie_dataset.ipynb) (https://github.com/divyachandramouli/Investigate_TMDb_Movie_Dataset/blob/master/DC_Investigate_TMDb_movie_dataset.ipynb)