1.InterDuction:

In my Case Study, I have investigated a dataset containing information about 10,000 movies collected from The Movie Database (TMDb).

I downloaded the TMDB 5000 Movie Dataset which is divided into two parts tmdb_5000_Credits and tmdb_5000_Movies and we saved it as ‘Credits.csv’, ‘Movies.csv’.

In this project, I will provide various visualizations to identify patterns in the data.

The orginal dataset click here In this analyzing, the following questions will be answered: ## Exercise 1 Does highly rated movies are associated with revenue ?

1.Let’s import all the necessary libraries to be used ahead

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

Use the read_csv() function from the readr package to load the data into a data frame.

df1 <- read_csv("tmdb_5000_credits.csv")

## Rows: 4803 Columns: 4

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): title, cast, crew
## dbl (1): movie_id

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

df2 <- read_csv("tmdb_5000_movies.csv")

## Rows: 4803 Columns: 20

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (12): genres, homepage, keywords, original_language, original_title, ov...
## dbl   (7): budget, id, popularity, revenue, runtime, vote_average, vote_count
## date  (1): release_date

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

now merge the two dataframes to make a movies_df dataframe.

x1.df = merge (df1,df2)

We only need a few of the columns from the data frame so use the select() function to retrieve the vote_average, vote_count,budget,revenue. Piping will be used for the rest of the code .

x1.df%>%select(title,vote_average,vote_count,budget,revenue)

Group the records by revenue.

x1.df%>%select(title,vote_average,vote_count,budget,revenue) %>%group_by(revenue)

6.Get a total Number of vote_average by using the summarize() function

x1.df%>%select(title,vote_average,vote_count,budget,revenue) %>%  group_by(revenue) %>% summarize(vote_average)

## `summarise()` has grouped output by 'revenue'. You can override using the `.groups` argument.

Finally, create a scatterplot with a regression line that depicts the number of vote average vs revenue

x1.df%>%select(title,vote_average,vote_count,budget,revenue) %>%group_by(revenue) %>% summarize(vote_average) %>%ggplot(mapping = aes(x=vote_average, y=log(revenue))) + geom_point() + geom_smooth(method=lm, se=TRUE) + ggtitle("vote_average vs revenue") + xlab("revenue") + ylab("Number of vote_average")

## `summarise()` has grouped output by 'revenue'. You can override using the `.groups` argument.

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 1430 rows containing non-finite values (stat_smooth).

A scatter plot illustrates the answer is “yes” .

Exercise 2 : Does highly rated movies are associated with Budget ?

In this exercise we’ll determine whether highly rated movies are associated with Budget or not ?

The tmdb_5000_credits.csv file contains a vote_average column that defines the number of vote for each movies .

Load readr, dplyr, ggplot2 packages library(readr) library(dplyr) library(ggplot2)
Summarize the data by determining the budget for each movies
Create a scatterplot of the results.

x1.df%>%select(budget,vote_average,vote_count,revenue) %>%group_by(budget) %>% summarize(vote_average) %>%ggplot(mapping = aes(x=vote_average, y=log(budget))) + geom_point() + geom_smooth(method=lm, se=TRUE) + ggtitle("Vote_average vs Budget ") + xlab("Budget ") + ylab("Number of vote_average")

## `summarise()` has grouped output by 'budget'. You can override using the `.groups` argument.

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 1039 rows containing non-finite values (stat_smooth).

Finally, let’s create a violin plot to see the distribution of Vote average by Budget

x1.df%>%select(budget,vote_average,vote_count,revenue) %>%group_by(budget) %>% summarize(vote_average,) %>%ggplot(mapping = aes(x=vote_average, y=log(budget))) + geom_violin() +geom_boxplot(width=0.1)  + ggtitle("Vote_average vs Budget ") + xlab("Budget ") + ylab("Number of vote_average")

## `summarise()` has grouped output by 'budget'. You can override using the `.groups` argument.

## Warning: Removed 1039 rows containing non-finite values (stat_ydensity).

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Removed 1039 rows containing non-finite values (stat_boxplot).

Exercise 3: Which length movies most liked by the audiences according to their popularity ?

Use the geom_col() function along with ggplot() to create a bar chart that displays the popularity by length movies.

x1.df%>%select(popularity,runtime,vote_count) %>%group_by(runtime,popularity) %>% summarize(runtime,mean(popularity)) %>% ggplot() + geom_col(mapping = aes(x=runtime, y=popularity), fill="pink")

## `summarise()` has grouped output by 'runtime', 'popularity'. You can override using the `.groups` argument.

## Warning: Removed 2 rows containing missing values (position_stack).

Exercise 4 Which Year Has The Highest budget ?

1.library(readr) library(dplyr) library(ggplot2)

For this script we’ll filter so that only Released and Post Production are selected from the status column .
And then Group the data by status and YR.
summarize the budget .

x1.df%>%select(budget,release_date,status) %>% filter(status%in% c('Released', 'Post Production'))%>%group_by(release_date,status) %>%summarize(budget)%>%ggplot(mapping = aes(x=release_date, y=log(budget),colour=status)) + geom_point() + geom_smooth(method=lm, se=TRUE) + ggtitle("Which Year Has The Highest budget") + xlab("Year") + ylab("Log of Total budget")

## `summarise()` has grouped output by 'release_date', 'status'. You can override using the `.groups` argument.

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 1035 rows containing non-finite values (stat_smooth).

## Warning in qt((1 - level)/2, df): NaNs produced

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

Exercise 5 How Does Popularity Depends On revenue?

1.library(readr) library(dplyr) library(ggplot2)

I only need a few of the columns from the data frame for this exercise so use the select() function to retrieve the Popularity, revenue columns.
And then Group the data by popularity
summarize the revenue .

x1.df%>%select(popularity, revenue) %>%group_by(popularity ) %>% summarize(revenue) %>%ggplot(mapping = aes(x=popularity , y=log(revenue))) + geom_point() + geom_smooth(method=lm, se=TRUE) + ggtitle("Popularity Vs revenue") + xlab("revenue") + ylab("Number of Popularity ")

## `summarise()` has grouped output by 'popularity'. You can override using the `.groups` argument.

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 1430 rows containing non-finite values (stat_smooth).

as we see Popularity and revenue have positive correlation.It means that movie with high popularity tends to earn high revenue .

Conclusions

This analysis had as purpose to perform an analysis of a database of information about 10,000 movies collected from The Movie Database (TMDb) . After read two dataset , I did merge the two dataframes to make a one dataframe. Then, I did plotted charts to highly rated movies are associated with revenue and also with dudget. Next, I did plotted charts for which length movies most liked by the audiences according to their popularity .

Finally, I did plotted charts for highest budget year and how does Popularity depends on revenue.

References

(https://github.com/LubnaAlhenaki/Data_Analyst_Nanodegree/blob/main/Project_2/Project2-LubnaAlhenaki.ipynb) (https://github.com/reemamohsin4/TMDb-Movie-Data/blob/master/Investigate_a_Dataset.ipynb) (https://github.com/divyachandramouli/Investigate_TMDb_Movie_Dataset/blob/master/DC_Investigate_TMDb_movie_dataset.ipynb) (https://github.com/divyachandramouli/Investigate_TMDb_Movie_Dataset/blob/master/DC_Investigate_TMDb_movie_dataset.ipynb)

TMDB 5000 Movie Dataset Analysis

Rawan Abdullah

06/12/2021