To: Professor Perine
From: Mohamed Nabeel, Campbell Cash, Jiin Kim, Nour Fouladi
Subject: Hollywood Movies dataset
Our task as a team is to analyze the dataset to see which movies had the most profit, which genres were the most popular, which studios made what genre movies and comparing audience reviews with different genres. The original dataset was pulled from https://www.lock5stat.com/datapage.html. The dataset has 16 different columns and 970 rows of data, meaning it is a medium sized dataset.
Variable, Type, Meaning
Profitably, Int, How much Profit a movie made
Genre, Character, Type of Movie
LeadStudio, Character, Company that made the movie
AudienceScore, Int, Audience ratings on the movie
df<-read.csv("HollywoodMovies.csv")
library(ggplot2)
[Put text of analysis here and add as many code chunks and text sections as you need.]
sum(df$Profitability,na.rm=TRUE)
## [1] 344619.6
clean_Profitability<-df$Profitability[!is.na(df$Profitability)]
sum(df$clean_Profitability,na.rm=TRUE)
## [1] 0
The data was cleaned by looking for all the na values in the gross column in the data frame, then using the ! to remove them all. I then saved that clean dataset in the clean_gross data frame. It is clear to see that the na values were removed because it went from 344,620 na values to 0.
summary(clean_Profitability)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.3 150.0 254.8 384.6 418.0 10175.9
sd(clean_Profitability)
## [1] 631.666
hist(clean_Profitability)
The spread of the data is skewed right. There are some movies that made a ton of money, with the top grossing movie making 10176. The data is skewed right because the tail of the dataset is at the right. The mean of the data is going to be influenced by the higher values, so we would use the median to represent the data.
My analysis for the one categorical variable with this Hollywood movie dataset is Genre. It is a categorical type of a variable and the level of measurement is nominal.
I think the bar chart plot is the best visualization for a single categorical variable which is Genre in this dataset. This frequency table shows the number of cases that fall in each category and the relative frequency table shows the proportion of each categorical variable in the general number of movies in a genre. The bar plot of the table shows how often a movie genre is made. For example, some bar indicates major Genre of this dataset such as action, biography, drama, musical, and thriller. A bar plot where proportions instead of frequencies are shown is a relative frequency table and bar plot. Since the bar plot can be easily read, the displayed does not need to be transformed. For the outliers which is N/A and blank are movies without a genre. That might help clean data to better analyze the genres with this dataset.
#Removes the na values in the Genre column
df$Genre<-df$Genre[!is.na(df$Genre)]
#Summarize the responses in a frequency table.
table(df$Genre)
##
## Action Adventure Animation Biography Comedy
## 279 166 30 51 14 177
## Crime Documentary Drama Fantasy Horror Musical
## 15 7 109 6 52 4
## Mystery Romance Thriller
## 5 20 35
genre <- table(df$Genre)
#Calculate a relative frequency table.
table(df$Genre)/970
##
## Action Adventure Animation Biography Comedy
## 0.287628866 0.171134021 0.030927835 0.052577320 0.014432990 0.182474227
## Crime Documentary Drama Fantasy Horror Musical
## 0.015463918 0.007216495 0.112371134 0.006185567 0.053608247 0.004123711
## Mystery Romance Thriller
## 0.005154639 0.020618557 0.036082474
#Use the table object created above.
genre/970
##
## Action Adventure Animation Biography Comedy
## 0.287628866 0.171134021 0.030927835 0.052577320 0.014432990 0.182474227
## Crime Documentary Drama Fantasy Horror Musical
## 0.015463918 0.007216495 0.112371134 0.006185567 0.053608247 0.004123711
## Mystery Romance Thriller
## 0.005154639 0.020618557 0.036082474
#Use nrow to automatically calculate.
genre/nrow(df)
##
## Action Adventure Animation Biography Comedy
## 0.287628866 0.171134021 0.030927835 0.052577320 0.014432990 0.182474227
## Crime Documentary Drama Fantasy Horror Musical
## 0.015463918 0.007216495 0.112371134 0.006185567 0.053608247 0.004123711
## Mystery Romance Thriller
## 0.005154639 0.020618557 0.036082474
#Create a frequency barplot.
barplot(genre)
#Create a relative frequency barplot.
barplot(genre/nrow(df))
The two way table with the studios and genres shows the number of movies a studio has made in a certain genre. Disney for example has made more movies in the ‘Animation’ genre more than any other category.
The bar plot of the table shows how often a movie genre is made. In order to make the appearance less confusing, I used a command that would make blank cells in the Genre category into “NA” so that the bar plot wouldn’t include those values. When running the bar plot, it should be stretched out for all the genres associated with the bars to be shown. The genres seem to be on an interval scale because the number of movies made in an genre have varying numbers that can be subtracted to find their differences. For example, comedy and action are made the most and we could find the difference between those two.
table(df$LeadStudio, df$Genre)
##
## Action Adventure Animation Biography Comedy Crime
## 1 4 0 0 0 2 0
## A24 3 0 0 0 0 0 0
## Aardman Animations 0 0 0 1 0 0 0
## ARC Entertainment 0 0 0 0 0 0 0
## Atlas Distribution 0 0 0 0 0 0 0
## Buena Vista 14 3 1 1 0 1 0
## CBS 8 1 0 0 0 1 0
## Cinedigm Entertainment 1 0 0 0 0 0 0
## Cohen Media 0 0 0 0 0 0 0
## Columbia 0 1 0 0 0 0 0
## Crest 0 0 0 1 0 0 0
## Disney 0 10 2 12 0 11 0
## DreamWorks 1 2 0 3 0 0 0
## Entertainment One 0 0 0 0 0 0 0
## Eros 4 0 0 0 0 0 0
## FilmDistrict 6 1 0 0 0 0 0
## Focus 10 0 0 1 0 3 1
## Fox 26 20 4 5 1 23 0
## Fox Searchlight 11 0 0 0 0 1 0
## Happy Madison 0 0 0 0 0 2 0
## Highlight Communications 0 1 0 0 0 0 0
## IFC 5 0 0 0 0 1 0
## Independent 0 18 4 5 4 39 4
## LD Entertainment 1 0 0 0 0 0 0
## Legendary Pictures 0 1 0 0 0 1 0
## Liberty Starz 0 0 0 0 0 0 0
## Lionsgate 11 12 1 1 0 5 2
## Magnolia 6 0 0 0 0 0 0
## Mediaplex 0 1 0 0 0 0 0
## MGM 0 1 1 0 0 1 0
## Millenium Entertainment 1 0 0 0 0 0 0
## Miramax 0 0 0 0 0 2 0
## Morgan Creek 0 0 0 0 0 0 0
## Music Box Films 0 0 0 0 0 0 0
## New Line 0 0 0 0 0 2 0
## Open Road 8 0 0 0 0 0 0
## Oscillloscope 0 0 0 0 0 0 0
## Overture 0 0 0 0 0 0 1
## Paramount 16 14 4 9 0 13 0
## Pixar 0 0 0 1 0 0 0
## Radius-TWC 0 0 0 0 1 0 0
## Regency Enterprises 0 0 0 0 0 0 0
## Relativity Media 8 8 0 2 0 7 0
## Reliance 0 0 0 0 0 0 0
## Roadside Attractions 7 0 0 0 0 0 0
## Rocky Mountain 0 0 0 0 0 0 0
## Samuel Goldwyn 1 0 0 0 0 1 0
## Screen Gems 2 1 0 0 0 0 0
## Sony 37 19 2 3 3 13 2
## Spyglass Entertainment 0 0 0 0 0 2 0
## Summit 9 5 1 1 0 4 0
## TriStar 6 0 0 0 0 0 0
## Universal 29 17 1 3 3 16 2
## UTV 0 0 0 0 0 1 0
## Vertigo 0 1 0 0 0 0 0
## Village Roadshow 0 0 0 1 0 0 0
## Virgin 0 0 0 0 0 0 0
## Warner Bros 28 22 8 1 2 19 3
## Weinstein 18 2 1 0 0 6 0
## Yash Raj 1 1 0 0 0 0 0
##
## Documentary Drama Fantasy Horror Musical Mystery
## 1 0 0 1 0 0
## A24 0 0 0 0 0 0
## Aardman Animations 0 0 0 0 0 0
## ARC Entertainment 1 1 0 0 0 0
## Atlas Distribution 0 1 0 0 0 0
## Buena Vista 0 0 0 0 0 0
## CBS 0 1 0 0 0 0
## Cinedigm Entertainment 0 0 0 0 0 0
## Cohen Media 0 1 0 0 0 0
## Columbia 0 1 0 0 0 0
## Crest 0 0 0 0 0 0
## Disney 0 3 0 0 0 0
## DreamWorks 0 1 0 1 0 0
## Entertainment One 0 1 0 0 0 0
## Eros 0 0 0 0 0 0
## FilmDistrict 0 0 0 0 0 0
## Focus 0 1 0 0 0 0
## Fox 0 6 0 2 0 0
## Fox Searchlight 0 1 0 0 0 0
## Happy Madison 0 0 0 0 0 0
## Highlight Communications 0 0 0 0 0 0
## IFC 0 0 0 0 0 0
## Independent 0 26 2 15 1 0
## LD Entertainment 0 1 0 0 0 0
## Legendary Pictures 0 0 0 0 0 0
## Liberty Starz 0 0 0 1 0 0
## Lionsgate 1 4 0 3 0 0
## Magnolia 0 0 0 0 0 0
## Mediaplex 0 0 0 0 0 0
## MGM 0 0 0 0 0 0
## Millenium Entertainment 0 0 0 0 0 0
## Miramax 0 1 0 1 0 0
## Morgan Creek 0 0 0 1 0 0
## Music Box Films 0 1 0 0 0 0
## New Line 0 0 0 1 0 0
## Open Road 0 0 0 0 0 0
## Oscillloscope 1 0 0 0 0 0
## Overture 0 0 0 0 0 0
## Paramount 1 14 1 5 0 0
## Pixar 0 0 0 0 0 0
## Radius-TWC 0 0 0 0 0 0
## Regency Enterprises 0 0 0 0 0 0
## Relativity Media 0 1 0 4 0 0
## Reliance 0 2 0 0 0 0
## Roadside Attractions 0 1 0 0 0 0
## Rocky Mountain 0 1 0 0 0 0
## Samuel Goldwyn 0 1 0 0 0 0
## Screen Gems 0 0 0 0 0 0
## Sony 1 11 0 2 0 2
## Spyglass Entertainment 0 1 0 0 0 0
## Summit 0 6 0 2 0 1
## TriStar 0 0 0 0 0 0
## Universal 0 5 1 2 0 2
## UTV 0 0 0 0 0 0
## Vertigo 0 0 0 0 0 0
## Village Roadshow 0 0 0 0 0 0
## Virgin 0 0 0 0 0 0
## Warner Bros 0 10 2 7 3 0
## Weinstein 1 5 0 4 0 0
## Yash Raj 0 0 0 0 0 0
##
## Romance Thriller
## 0 0
## A24 0 0
## Aardman Animations 0 0
## ARC Entertainment 0 0
## Atlas Distribution 0 0
## Buena Vista 0 0
## CBS 1 0
## Cinedigm Entertainment 0 0
## Cohen Media 0 0
## Columbia 0 1
## Crest 0 0
## Disney 0 1
## DreamWorks 0 0
## Entertainment One 0 0
## Eros 0 0
## FilmDistrict 0 0
## Focus 0 1
## Fox 1 3
## Fox Searchlight 0 0
## Happy Madison 1 0
## Highlight Communications 0 0
## IFC 0 0
## Independent 7 8
## LD Entertainment 0 0
## Legendary Pictures 0 0
## Liberty Starz 0 1
## Lionsgate 0 0
## Magnolia 0 0
## Mediaplex 0 0
## MGM 0 2
## Millenium Entertainment 0 0
## Miramax 0 0
## Morgan Creek 0 0
## Music Box Films 0 0
## New Line 0 0
## Open Road 0 0
## Oscillloscope 0 0
## Overture 0 0
## Paramount 0 4
## Pixar 0 0
## Radius-TWC 0 0
## Regency Enterprises 0 1
## Relativity Media 0 0
## Reliance 0 0
## Roadside Attractions 0 0
## Rocky Mountain 0 0
## Samuel Goldwyn 0 0
## Screen Gems 0 0
## Sony 1 2
## Spyglass Entertainment 0 0
## Summit 2 2
## TriStar 0 0
## Universal 1 3
## UTV 0 0
## Vertigo 0 0
## Village Roadshow 0 0
## Virgin 0 1
## Warner Bros 4 5
## Weinstein 2 0
## Yash Raj 0 0
barplot(table(df$LeadStudio, df$Genre))
df$Genre[df$Genre==""]<-NA
…
My categorical and quantitative variables from the HollywoodMovies dataset are the Genre and AudienceScore. Through the boxplot, one can infer the differences of audience ratings depending on the genre of a movie, thus certain genres are more popular than others. The boxplot directly determines the genres that receive the least to greatest amounts of ratings.
Calculating summary stats for all columns of the dataset
summary(df)
## Movie LeadStudio RottenTomatoes AudienceScore
## Length:970 Length:970 Min. : 0.00 Min. :19.00
## Class :character Class :character 1st Qu.:28.00 1st Qu.:49.00
## Mode :character Mode :character Median :52.00 Median :61.00
## Mean :51.71 Mean :61.27
## 3rd Qu.:75.00 3rd Qu.:74.00
## Max. :99.00 Max. :96.00
## NA's :57 NA's :63
## Story Genre TheatersOpenWeek OpeningWeekend
## Length:970 Length:970 Min. : 1 Min. : 0.01
## Class :character Class :character 1st Qu.:2054 1st Qu.: 5.30
## Mode :character Mode :character Median :2798 Median : 13.15
## Mean :2495 Mean : 20.62
## 3rd Qu.:3285 3rd Qu.: 26.20
## Max. :4468 Max. :207.44
## NA's :21 NA's :1
## BOAvgOpenWeekend DomesticGross ForeignGross WorldGross
## Min. : 28 Min. : 0.06 Min. : 0.00 Min. : 0.10
## 1st Qu.: 3528 1st Qu.: 17.57 1st Qu.: 16.67 1st Qu.: 38.36
## Median : 5983 Median : 40.41 Median : 46.66 Median : 88.18
## Mean : 8563 Mean : 68.16 Mean : 101.24 Mean : 169.01
## 3rd Qu.: 9790 3rd Qu.: 89.25 3rd Qu.: 111.91 3rd Qu.: 202.31
## Max. :147262 Max. :760.50 Max. :2021.00 Max. :2781.50
## NA's :25 NA's :94 NA's :56
## Budget Profitability OpenProfit Year
## Min. : 0.00 Min. : 2.3 Min. : 0.16 Min. :2007
## 1st Qu.: 20.00 1st Qu.: 150.0 1st Qu.: 19.50 1st Qu.:2009
## Median : 35.00 Median : 254.8 Median : 34.61 Median :2010
## Mean : 56.12 Mean : 384.6 Mean : 62.22 Mean :2010
## 3rd Qu.: 75.00 3rd Qu.: 418.0 3rd Qu.: 58.38 3rd Qu.:2012
## Max. :300.00 Max. :10175.9 Max. :3373.00 Max. :2013
## NA's :73 NA's :74 NA's :75
Calculating summary stats for an individual column
summary(df$AudienceScore)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 19.00 49.00 61.00 61.27 74.00 96.00 63
Drawing the side-by-side boxplot
ggplot(data = df, mapping = aes(x = Genre, y = AudienceScore)) +
geom_boxplot() + labs(y = "AudienceScore")
## Warning: Removed 63 rows containing non-finite values (stat_boxplot).
Drawing a neat, reordered side-by-side boxplot
ggplot(data = df, mapping = aes(x = reorder(Genre, AudienceScore, median, na.rm = TRUE), y = AudienceScore)) + geom_boxplot() + labs(x = "Genre", y = "AudienceScore")
## Warning: Removed 63 rows containing non-finite values (stat_boxplot).
Tourkish did not submit any files, despite the team constantly reaching out to him.
One interesting thing I noticed about the Profitability of the movies is that the outliers made so much more profit that the rest of the dataset. For example, the movie that made the most profit made 10176, while the Q3 value is 418, which is such a huge increase. This is definitely something to look more into. Another interesting thing I noticed was that Animation movies we the most frequented movie, by a lot. This is the case of overall animation movies and movies made specifically by Disney, so this is something to look more into as well.