The objective of this analysis was to analyze the movies released from 2007-2011, the goal of the deep dive was to find out the correlations between audience ratings and Genre,critic ratings and audience ratings comparison do they have something in common or not, and many other insights which would make a person who would want to make a movie understand which genre would be the most profittable and can make ones movie successful.
For this I used structured datasets named Movie_Ratings and Movies_Collections which were downloaded from third party named Superdatascience and I am thankful to my instructor Mr. Kiril.
Datasets which I downloaded were reliable as the data was unbiased, They were original as it was taken from an original source, they were comprehensible as the datasets contained critical information to conduct the anlysis,furthermore they were cited which made my insights credible,though they may not be current but this led to an analysis of historical data which can help me in my future analysis.
In this the dataset Movies_Ratings was checked to see any blank spaces by using commands go to [ctrl+g + alt+s] to check if any exist and then I cleared them. Furthermore converted the range into a table to run power query on it where i could sort the data and split the Budget in million $ column in Movies_collections dataset which had names of char data type as contained mln in each cell, so i split the columns with delimiter ’ ’ and saved the data and loaded in a new worksheet, later I converted the table to a range to load it in R.
The purpose of cleaning data + aligning it to business objectives was to provide accurate conclusions
Notes: Settting up my R environment by loading ‘ggplot2’ package and Movie-Ratings dataset.
movies<-read.csv("Movie-Ratings.csv")
colnames(movies)<- c("Film", "Genre","Critic_rating","Audience_rating","Budget_mln","Year_of_release")
str(movies)
## 'data.frame': 562 obs. of 6 variables:
## $ Film : Factor w/ 562 levels "(500) Days of Summer ",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Genre : Factor w/ 7 levels "Action","Adventure",..: 3 2 1 2 3 1 3 5 3 3 ...
## $ Critic_rating : int 87 9 30 93 55 39 40 50 43 93 ...
## $ Audience_rating: int 81 44 52 84 70 63 71 57 48 93 ...
## $ Budget_mln : int 8 105 20 18 20 200 30 32 28 8 ...
## $ Year_of_release: int 2009 2008 2009 2010 2009 2009 2008 2007 2011 2011 ...
movies$Year_of_release<-factor(movies$Year_of_release)
summary(movies)
## Film Genre Critic_rating Audience_rating
## (500) Days of Summer : 1 Action :154 Min. : 0.0 Min. : 0.00
## 10,000 B.C. : 1 Adventure: 29 1st Qu.:25.0 1st Qu.:47.00
## 12 Rounds : 1 Comedy :172 Median :46.0 Median :58.00
## 127 Hours : 1 Drama :101 Mean :47.4 Mean :58.83
## 17 Again : 1 Horror : 49 3rd Qu.:70.0 3rd Qu.:72.00
## 2012 : 1 Romance : 21 Max. :97.0 Max. :96.00
## (Other) :556 Thriller : 36
## Budget_mln Year_of_release
## Min. : 0.0 2007: 79
## 1st Qu.: 20.0 2008:125
## Median : 35.0 2009:116
## Mean : 50.1 2010:119
## 3rd Qu.: 65.0 2011:123
## Max. :300.0
##
csv file was imported using the function read.csv() which reads a file in table format and creates a data frame which in this case is movies from it, with cases corresponding to lines and variables to fields in the file. The column names were changed by function colnames(), by using the str() that the year column wasnt a factor but int and for my analysis I needed it to be a factor with only 5 levels so i converted it by using function factor() then I used summary() to get some statiscal informtion.
library(ggplot2)
q<-ggplot(data=movies,aes(x=Critic_rating,y=Audience_rating,
colour=Genre, size=Budget_mln))
Activation of the package ggplot2 which is used for high-end visualizations in R is done, Then I created an object q which stores the list which staisfy the criteria. aes() is the aesthetic mappings function which maps variables to visual properties of geoms in this we can give the color,shape,size of the plot we want to create.
q+geom_point(aes(x=Budget_mln)) +
xlab("Budget in million $")+
theme(axis.title.x = element_text(color="Blue",size=15),
axis.title.y = element_text(color="Blue",size=15))
The plot created is a Scatter plot. Through this plot we understand that there is no correlation between budget spent on a movie and audience ratings.
aud_rat_hist<-ggplot(data=movies,aes(x=Audience_rating))
aud_rat_hist+geom_histogram(binwidth = 10,fill="white", color="Blue")+
ylab("no of people rating")+
theme(axis.title.x = element_text(color="black",size=10),
axis.title.y = element_text(color="black",size=10))
Through this histogram we can see a normal distribution of the ratings of audience,where the average rating stands between 50%-70% out of 100% this shows that through audience the rating usually is above average.
critic_rating_hist<-ggplot(data=movies)
critic_rating_hist+geom_histogram(binwidth = 10,aes(x=Critic_rating),fill="white", color="Blue")+
ylab("no of people rating") +
theme(axis.title.x = element_text(color="Black",size=10),
axis.title.y = element_text(color="Black",size=10))
Compared to audience ratings, critic ratings are more uniformly distributed.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
movies %>%
ggplot(aes(x=Year_of_release))+
geom_bar(aes(fill=Genre),color="black",position="dodge")+
xlab("Year of Release")+ylab("No of movies")
Through this bar chart we can say that the top 3 movie genres in volume are Comedy, Action, Drama.
aud_vs_crit<-ggplot(data=movies,aes(y=Audience_rating,x=Critic_rating,color=Genre))
aud_vs_crit + geom_point(alpha=0.5)+geom_smooth(fill=NA)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The bar which is plotted is combination of geom_point() which gives scatterplot and geom_smooth() which provides trends, alpha() is used to increase transparency. Through this plot the insights that we can uncover is that when it comes to Romance movies when critics give a low rating is higly likely that audience will give good rating. As for Action and Horror movies they have a concensus.
Genre_aud_rat<-ggplot(data=movies,aes(x=Genre,y=Audience_rating,color=Genre))
Genre_aud_rat+geom_jitter()+geom_boxplot(size=1.2,alpha=0.6)
This chart is created by combining boxplot and points however to decrease the overlap of points we used geom_jitter() Through this plot we can uncover that a movie to be successful in the eyes of audience for sure stands in the genre of Thriller and Romance, Drama too makes the cut but its quite volatile, the opposite can be said for Horror.
Genre_crit_rat<-ggplot(data=movies,aes(x=Genre,y=Critic_rating,color=Genre))
Genre_crit_rat+geom_jitter()+geom_boxplot(size=1.2,alpha=0.6)
From this plot we can say that critics ratings can not be easily predicted as you can see the inter quartile region is quite big and the median seems to high for only Thriller, and for horror it has the lowest, So directors if you want to impress the critics try Thriller…
movies_new<-read.csv("New_Movies_ds.csv")
colnames(movies_new)<-c("Day","Director","Genre","Movie_Name","Release_date","studio","Adjusted_gross_mln","Budget_mln","Gross_mln","IMDb_rating","Movielens_rating","Overseas_collection_mln","Overseas_Percentage","Profit_mln","Profit_Percentage","Runtime_mins","US_collection","Gross_percent_US")
Here we plot the No of movies released per day
ggplot(data=movies_new,aes(x=Day))+geom_bar(aes(fill=Day))
Movies mostly release on Fridays
fil_genre<-movies_new$Genre=="action" |movies_new$Genre=="animation"|movies_new$Genre=="adventure"|movies_new$Genre=="comedy"|movies_new$Genre=="drama"|movies_new$Genre=="thriller"
fil_studio<-movies_new$studio %in% c("Buena Vista Studios","Fox","Paramount Pictures","WB","Sony","Universal")
req_mov<-movies_new[fil_genre & fil_studio,]
In this the new dataframe req_mov contains all the movies that are filtered by the two filters of studio & Genre.
gross_perc_genre<-ggplot(data=req_mov,aes(x=Genre,y=Gross_percent_US))
final_mv_viz<-gross_perc_genre+geom_jitter(alpha=0.7, aes(size=Budget_mln,color=studio))+geom_boxplot(alpha=0.6, outlier.colour = NA)
final_mv_viz+xlab("Genre")+ylab("Gross % US")+ggtitle("Domestic Gross % by Genre")+
theme(axis.title.x = element_text(color="Blue",size=15),
axis.title.y = element_text(color="Blue",size=15),
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
legend.text = element_text(size=10),
legend.title=element_text(size=10),
plot.title = element_text(color="Black",size = 20,family = "TT Arial"),
text=element_text(family = "Arial"))
In this code we added layers through xlab(),ylab(),ggtitle(),theme() functions. xlab() is used to denote the x axis title, same goes ylab() for y axis and ggtitle for main title. theme() is used to make the chart pretty by adding color and adjusting size of title, axis text, legend.
The plot shows points that are movies released by the studios present in plot, and size of the plot is according to the budget spent for ech movie, the boxplot shows the distribution of Gross collections % in US and with this we can know that stable collections are obtained by genre Comedy, seems like people love to watch comedy movies. Though maximum collections are by action movies with reference to the boxplot there is volatality means it could go either way.