PROJECT DESCRIPTION

Objective:

The objective of this analysis was to analyze the movies released from 2007-2011, the goal of the deep dive was to find out the correlations between audience ratings and Genre,critic ratings and audience ratings comparison do they have something in common or not, and many other insights which would make a person who would want to make a movie understand which genre would be the most profittable and can make ones movie successful.

Resources & Reference:

For this I used structured datasets named Movie_Ratings and Movies_Collections which were downloaded from third party named Superdatascience and I am thankful to my instructor Mr. Kiril.

Datasets clearing the ROCCC parameters:

Datasets which I downloaded were reliable as the data was unbiased, They were original as it was taken from an original source, they were comprehensible as the datasets contained critical information to conduct the anlysis,furthermore they were cited which made my insights credible,though they may not be current but this led to an analysis of historical data which can help me in my future analysis.

Data Pre-Processing

Data Cleaning in Excel:

In this the dataset Movies_Ratings was checked to see any blank spaces by using commands go to [ctrl+g + alt+s] to check if any exist and then I cleared them. Furthermore converted the range into a table to run power query on it where i could sort the data and split the Budget in million $ column in Movies_collections dataset which had names of char data type as contained mln in each cell, so i split the columns with delimiter ’ ’ and saved the data and loaded in a new worksheet, later I converted the table to a range to load it in R.

The purpose of cleaning data + aligning it to business objectives was to provide accurate conclusions

Setting up my Environment Layer wise

DATA LAYER

Notes: Settting up my R environment by loading ‘ggplot2’ package and Movie-Ratings dataset.

movies<-read.csv("Movie-Ratings.csv")
colnames(movies)<- c("Film", "Genre","Critic_rating","Audience_rating","Budget_mln","Year_of_release")
str(movies)
## 'data.frame':    562 obs. of  6 variables:
##  $ Film           : Factor w/ 562 levels "(500) Days of Summer ",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Genre          : Factor w/ 7 levels "Action","Adventure",..: 3 2 1 2 3 1 3 5 3 3 ...
##  $ Critic_rating  : int  87 9 30 93 55 39 40 50 43 93 ...
##  $ Audience_rating: int  81 44 52 84 70 63 71 57 48 93 ...
##  $ Budget_mln     : int  8 105 20 18 20 200 30 32 28 8 ...
##  $ Year_of_release: int  2009 2008 2009 2010 2009 2009 2008 2007 2011 2011 ...
movies$Year_of_release<-factor(movies$Year_of_release)
summary(movies)
##                     Film           Genre     Critic_rating  Audience_rating
##  (500) Days of Summer :  1   Action   :154   Min.   : 0.0   Min.   : 0.00  
##  10,000 B.C.          :  1   Adventure: 29   1st Qu.:25.0   1st Qu.:47.00  
##  12 Rounds            :  1   Comedy   :172   Median :46.0   Median :58.00  
##  127 Hours            :  1   Drama    :101   Mean   :47.4   Mean   :58.83  
##  17 Again             :  1   Horror   : 49   3rd Qu.:70.0   3rd Qu.:72.00  
##  2012                 :  1   Romance  : 21   Max.   :97.0   Max.   :96.00  
##  (Other)              :556   Thriller : 36                                 
##    Budget_mln    Year_of_release
##  Min.   :  0.0   2007: 79       
##  1st Qu.: 20.0   2008:125       
##  Median : 35.0   2009:116       
##  Mean   : 50.1   2010:119       
##  3rd Qu.: 65.0   2011:123       
##  Max.   :300.0                  
## 

csv file was imported using the function read.csv() which reads a file in table format and creates a data frame which in this case is movies from it, with cases corresponding to lines and variables to fields in the file. The column names were changed by function colnames(), by using the str() that the year column wasnt a factor but int and for my analysis I needed it to be a factor with only 5 levels so i converted it by using function factor() then I used summary() to get some statiscal informtion.

AESTHETICS LAYER & GEOMETRY LAYER

library(ggplot2)
q<-ggplot(data=movies,aes(x=Critic_rating,y=Audience_rating,
                       colour=Genre, size=Budget_mln))

Activation of the package ggplot2 which is used for high-end visualizations in R is done, Then I created an object q which stores the list which staisfy the criteria. aes() is the aesthetic mappings function which maps variables to visual properties of geoms in this we can give the color,shape,size of the plot we want to create.

INSIGHT 1: Audience ratings vs Budget spent

q+geom_point(aes(x=Budget_mln)) +
 xlab("Budget in million $")+ 
  theme(axis.title.x = element_text(color="Blue",size=15),
        axis.title.y = element_text(color="Blue",size=15))

The plot created is a Scatter plot. Through this plot we understand that there is no correlation between budget spent on a movie and audience ratings.

INSIGHT 2: Audience Ratings

aud_rat_hist<-ggplot(data=movies,aes(x=Audience_rating))
aud_rat_hist+geom_histogram(binwidth = 10,fill="white", color="Blue")+
ylab("no of people rating")+ 
  theme(axis.title.x = element_text(color="black",size=10),
        axis.title.y = element_text(color="black",size=10))

Through this histogram we can see a normal distribution of the ratings of audience,where the average rating stands between 50%-70% out of 100% this shows that through audience the rating usually is above average.

INSIGHT 3: Critics Ratings

critic_rating_hist<-ggplot(data=movies)
critic_rating_hist+geom_histogram(binwidth = 10,aes(x=Critic_rating),fill="white", color="Blue")+
  ylab("no of people rating") + 
  theme(axis.title.x = element_text(color="Black",size=10),
        axis.title.y = element_text(color="Black",size=10))

Compared to audience ratings, critic ratings are more uniformly distributed.

INSIGHT 4: Movies released per year

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
movies %>% 
  ggplot(aes(x=Year_of_release))+
geom_bar(aes(fill=Genre),color="black",position="dodge")+
xlab("Year of Release")+ylab("No of movies")

Through this bar chart we can say that the top 3 movie genres in volume are Comedy, Action, Drama.

INSIGHT 5: Audience Ratings vs Critic Ratings

aud_vs_crit<-ggplot(data=movies,aes(y=Audience_rating,x=Critic_rating,color=Genre)) 
 aud_vs_crit + geom_point(alpha=0.5)+geom_smooth(fill=NA) 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The bar which is plotted is combination of geom_point() which gives scatterplot and geom_smooth() which provides trends, alpha() is used to increase transparency. Through this plot the insights that we can uncover is that when it comes to Romance movies when critics give a low rating is higly likely that audience will give good rating. As for Action and Horror movies they have a concensus.

INSIGHT 6: Audience ratings and Genre

Genre_aud_rat<-ggplot(data=movies,aes(x=Genre,y=Audience_rating,color=Genre))
Genre_aud_rat+geom_jitter()+geom_boxplot(size=1.2,alpha=0.6)

This chart is created by combining boxplot and points however to decrease the overlap of points we used geom_jitter() Through this plot we can uncover that a movie to be successful in the eyes of audience for sure stands in the genre of Thriller and Romance, Drama too makes the cut but its quite volatile, the opposite can be said for Horror.

INSIGHT 7: Critic ratings and Genre

Genre_crit_rat<-ggplot(data=movies,aes(x=Genre,y=Critic_rating,color=Genre))
Genre_crit_rat+geom_jitter()+geom_boxplot(size=1.2,alpha=0.6)

From this plot we can say that critics ratings can not be easily predicted as you can see the inter quartile region is quite big and the median seems to high for only Thriller, and for horror it has the lowest, So directors if you want to impress the critics try Thriller…

movies_new<-read.csv("New_Movies_ds.csv")
colnames(movies_new)<-c("Day","Director","Genre","Movie_Name","Release_date","studio","Adjusted_gross_mln","Budget_mln","Gross_mln","IMDb_rating","Movielens_rating","Overseas_collection_mln","Overseas_Percentage","Profit_mln","Profit_Percentage","Runtime_mins","US_collection","Gross_percent_US")

Movies Vs Release_Day

Here we plot the No of movies released per day

ggplot(data=movies_new,aes(x=Day))+geom_bar(aes(fill=Day)) 

Movies mostly release on Fridays

Filtered Genre

fil_genre<-movies_new$Genre=="action" |movies_new$Genre=="animation"|movies_new$Genre=="adventure"|movies_new$Genre=="comedy"|movies_new$Genre=="drama"|movies_new$Genre=="thriller"

Filtered Studios

fil_studio<-movies_new$studio %in% c("Buena Vista Studios","Fox","Paramount Pictures","WB","Sony","Universal")

Data Transformation

req_mov<-movies_new[fil_genre & fil_studio,]

In this the new dataframe req_mov contains all the movies that are filtered by the two filters of studio & Genre.

Creating variable for the Viz

gross_perc_genre<-ggplot(data=req_mov,aes(x=Genre,y=Gross_percent_US))
final_mv_viz<-gross_perc_genre+geom_jitter(alpha=0.7, aes(size=Budget_mln,color=studio))+geom_boxplot(alpha=0.6, outlier.colour = NA)

INSIGHT 8: Gross US% Vs Genre

final_mv_viz+xlab("Genre")+ylab("Gross % US")+ggtitle("Domestic Gross % by Genre")+
  
  theme(axis.title.x = element_text(color="Blue",size=15),
        axis.title.y = element_text(color="Blue",size=15),
        axis.text.x = element_text(size = 10),
        axis.text.y = element_text(size = 10),
        legend.text = element_text(size=10),
        legend.title=element_text(size=10), 
        plot.title = element_text(color="Black",size = 20,family = "TT Arial"),
        text=element_text(family = "Arial"))

In this code we added layers through xlab(),ylab(),ggtitle(),theme() functions. xlab() is used to denote the x axis title, same goes ylab() for y axis and ggtitle for main title. theme() is used to make the chart pretty by adding color and adjusting size of title, axis text, legend.

The plot shows points that are movies released by the studios present in plot, and size of the plot is according to the budget spent for ech movie, the boxplot shows the distribution of Gross collections % in US and with this we can know that stable collections are obtained by genre Comedy, seems like people love to watch comedy movies. Though maximum collections are by action movies with reference to the boxplot there is volatality means it could go either way.

Well That is the end of my analysis, I hope you liked it!!