To: Professor Perine

From: Mohamed Nabeel, Campbell Cash, Jiin Kim, Nour Fouladi

Subject: Hollywood Movies dataset

Background

Our task as a team is to analyze the dataset to see which movies had the most profit, which genres were the most popular, which studios made what genre movies and comparing audience reviews with different genres. The original dataset was pulled from https://www.lock5stat.com/datapage.html. The dataset has 16 different columns and 970 rows of data, meaning it is a medium sized dataset.

Variable, Type, Meaning

Profitably, Int, How much Profit a movie made

Genre, Character, Type of Movie

LeadStudio, Character, Company that made the movie

AudienceScore, Int, Audience ratings on the movie

df<-read.csv("HollywoodMovies.csv")
library(ggplot2)

Data Analyses

Analysis 1: One Quantitative Variable

Analyst: Mohamed Nabeel

[Put text of analysis here and add as many code chunks and text sections as you need.]

sum(df$Profitability,na.rm=TRUE)
## [1] 344619.6
clean_Profitability<-df$Profitability[!is.na(df$Profitability)]
sum(df$clean_Profitability,na.rm=TRUE)
## [1] 0

The data was cleaned by looking for all the na values in the gross column in the data frame, then using the ! to remove them all. I then saved that clean dataset in the clean_gross data frame. It is clear to see that the na values were removed because it went from 344,620 na values to 0.

summary(clean_Profitability)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.3   150.0   254.8   384.6   418.0 10175.9
sd(clean_Profitability)
## [1] 631.666
hist(clean_Profitability)

The spread of the data is skewed right. There are some movies that made a ton of money, with the top grossing movie making 10176. The data is skewed right because the tail of the dataset is at the right. The mean of the data is going to be influenced by the higher values, so we would use the median to represent the data.

Analysis 2: One Categorical Variable

Analyst: Jiin Kim

My analysis for the one categorical variable with this Hollywood movie dataset is Genre. It is a categorical type of a variable and the level of measurement is nominal.

I think the bar chart plot is the best visualization for a single categorical variable which is Genre in this dataset. This frequency table shows the number of cases that fall in each category and the relative frequency table shows the proportion of each categorical variable in the general number of movies in a genre. The bar plot of the table shows how often a movie genre is made. For example, some bar indicates major Genre of this dataset such as action, biography, drama, musical, and thriller. A bar plot where proportions instead of frequencies are shown is a relative frequency table and bar plot. Since the bar plot can be easily read, the displayed does not need to be transformed. For the outliers which is N/A and blank are movies without a genre. That might help clean data to better analyze the genres with this dataset.

#Removes the na values in the Genre column
df$Genre<-df$Genre[!is.na(df$Genre)]
#Summarize the responses in a frequency table.
table(df$Genre)
## 
##                  Action   Adventure   Animation   Biography      Comedy 
##         279         166          30          51          14         177 
##       Crime Documentary       Drama     Fantasy      Horror     Musical 
##          15           7         109           6          52           4 
##     Mystery     Romance    Thriller 
##           5          20          35
genre <- table(df$Genre)
#Calculate a relative frequency table.
table(df$Genre)/970
## 
##                  Action   Adventure   Animation   Biography      Comedy 
## 0.287628866 0.171134021 0.030927835 0.052577320 0.014432990 0.182474227 
##       Crime Documentary       Drama     Fantasy      Horror     Musical 
## 0.015463918 0.007216495 0.112371134 0.006185567 0.053608247 0.004123711 
##     Mystery     Romance    Thriller 
## 0.005154639 0.020618557 0.036082474
#Use the table object created above.
genre/970
## 
##                  Action   Adventure   Animation   Biography      Comedy 
## 0.287628866 0.171134021 0.030927835 0.052577320 0.014432990 0.182474227 
##       Crime Documentary       Drama     Fantasy      Horror     Musical 
## 0.015463918 0.007216495 0.112371134 0.006185567 0.053608247 0.004123711 
##     Mystery     Romance    Thriller 
## 0.005154639 0.020618557 0.036082474
#Use nrow to automatically calculate.
genre/nrow(df)
## 
##                  Action   Adventure   Animation   Biography      Comedy 
## 0.287628866 0.171134021 0.030927835 0.052577320 0.014432990 0.182474227 
##       Crime Documentary       Drama     Fantasy      Horror     Musical 
## 0.015463918 0.007216495 0.112371134 0.006185567 0.053608247 0.004123711 
##     Mystery     Romance    Thriller 
## 0.005154639 0.020618557 0.036082474
#Create a frequency barplot.
barplot(genre)

#Create a relative frequency barplot.
barplot(genre/nrow(df))

Analysis 3: Two Categorical Variables

Analyst: Campbell Cash

The two way table with the studios and genres shows the number of movies a studio has made in a certain genre. Disney for example has made more movies in the ‘Animation’ genre more than any other category.

The bar plot of the table shows how often a movie genre is made. In order to make the appearance less confusing, I used a command that would make blank cells in the Genre category into “NA” so that the bar plot wouldn’t include those values. When running the bar plot, it should be stretched out for all the genres associated with the bars to be shown. The genres seem to be on an interval scale because the number of movies made in an genre have varying numbers that can be subtracted to find their differences. For example, comedy and action are made the most and we could find the difference between those two.

table(df$LeadStudio, df$Genre)
##                           
##                               Action Adventure Animation Biography Comedy Crime
##                             1      4         0         0         0      2     0
##   A24                       3      0         0         0         0      0     0
##   Aardman Animations        0      0         0         1         0      0     0
##   ARC Entertainment         0      0         0         0         0      0     0
##   Atlas Distribution        0      0         0         0         0      0     0
##   Buena Vista              14      3         1         1         0      1     0
##   CBS                       8      1         0         0         0      1     0
##   Cinedigm Entertainment    1      0         0         0         0      0     0
##   Cohen Media               0      0         0         0         0      0     0
##   Columbia                  0      1         0         0         0      0     0
##   Crest                     0      0         0         1         0      0     0
##   Disney                    0     10         2        12         0     11     0
##   DreamWorks                1      2         0         3         0      0     0
##   Entertainment One         0      0         0         0         0      0     0
##   Eros                      4      0         0         0         0      0     0
##   FilmDistrict              6      1         0         0         0      0     0
##   Focus                    10      0         0         1         0      3     1
##   Fox                      26     20         4         5         1     23     0
##   Fox Searchlight          11      0         0         0         0      1     0
##   Happy Madison             0      0         0         0         0      2     0
##   Highlight Communications  0      1         0         0         0      0     0
##   IFC                       5      0         0         0         0      1     0
##   Independent               0     18         4         5         4     39     4
##   LD Entertainment          1      0         0         0         0      0     0
##   Legendary Pictures        0      1         0         0         0      1     0
##   Liberty Starz             0      0         0         0         0      0     0
##   Lionsgate                11     12         1         1         0      5     2
##   Magnolia                  6      0         0         0         0      0     0
##   Mediaplex                 0      1         0         0         0      0     0
##   MGM                       0      1         1         0         0      1     0
##   Millenium Entertainment   1      0         0         0         0      0     0
##   Miramax                   0      0         0         0         0      2     0
##   Morgan Creek              0      0         0         0         0      0     0
##   Music Box Films           0      0         0         0         0      0     0
##   New Line                  0      0         0         0         0      2     0
##   Open Road                 8      0         0         0         0      0     0
##   Oscillloscope             0      0         0         0         0      0     0
##   Overture                  0      0         0         0         0      0     1
##   Paramount                16     14         4         9         0     13     0
##   Pixar                     0      0         0         1         0      0     0
##   Radius-TWC                0      0         0         0         1      0     0
##   Regency Enterprises       0      0         0         0         0      0     0
##   Relativity Media          8      8         0         2         0      7     0
##   Reliance                  0      0         0         0         0      0     0
##   Roadside Attractions      7      0         0         0         0      0     0
##   Rocky Mountain            0      0         0         0         0      0     0
##   Samuel Goldwyn            1      0         0         0         0      1     0
##   Screen Gems               2      1         0         0         0      0     0
##   Sony                     37     19         2         3         3     13     2
##   Spyglass Entertainment    0      0         0         0         0      2     0
##   Summit                    9      5         1         1         0      4     0
##   TriStar                   6      0         0         0         0      0     0
##   Universal                29     17         1         3         3     16     2
##   UTV                       0      0         0         0         0      1     0
##   Vertigo                   0      1         0         0         0      0     0
##   Village Roadshow          0      0         0         1         0      0     0
##   Virgin                    0      0         0         0         0      0     0
##   Warner Bros              28     22         8         1         2     19     3
##   Weinstein                18      2         1         0         0      6     0
##   Yash Raj                  1      1         0         0         0      0     0
##                           
##                            Documentary Drama Fantasy Horror Musical Mystery
##                                      1     0       0      1       0       0
##   A24                                0     0       0      0       0       0
##   Aardman Animations                 0     0       0      0       0       0
##   ARC Entertainment                  1     1       0      0       0       0
##   Atlas Distribution                 0     1       0      0       0       0
##   Buena Vista                        0     0       0      0       0       0
##   CBS                                0     1       0      0       0       0
##   Cinedigm Entertainment             0     0       0      0       0       0
##   Cohen Media                        0     1       0      0       0       0
##   Columbia                           0     1       0      0       0       0
##   Crest                              0     0       0      0       0       0
##   Disney                             0     3       0      0       0       0
##   DreamWorks                         0     1       0      1       0       0
##   Entertainment One                  0     1       0      0       0       0
##   Eros                               0     0       0      0       0       0
##   FilmDistrict                       0     0       0      0       0       0
##   Focus                              0     1       0      0       0       0
##   Fox                                0     6       0      2       0       0
##   Fox Searchlight                    0     1       0      0       0       0
##   Happy Madison                      0     0       0      0       0       0
##   Highlight Communications           0     0       0      0       0       0
##   IFC                                0     0       0      0       0       0
##   Independent                        0    26       2     15       1       0
##   LD Entertainment                   0     1       0      0       0       0
##   Legendary Pictures                 0     0       0      0       0       0
##   Liberty Starz                      0     0       0      1       0       0
##   Lionsgate                          1     4       0      3       0       0
##   Magnolia                           0     0       0      0       0       0
##   Mediaplex                          0     0       0      0       0       0
##   MGM                                0     0       0      0       0       0
##   Millenium Entertainment            0     0       0      0       0       0
##   Miramax                            0     1       0      1       0       0
##   Morgan Creek                       0     0       0      1       0       0
##   Music Box Films                    0     1       0      0       0       0
##   New Line                           0     0       0      1       0       0
##   Open Road                          0     0       0      0       0       0
##   Oscillloscope                      1     0       0      0       0       0
##   Overture                           0     0       0      0       0       0
##   Paramount                          1    14       1      5       0       0
##   Pixar                              0     0       0      0       0       0
##   Radius-TWC                         0     0       0      0       0       0
##   Regency Enterprises                0     0       0      0       0       0
##   Relativity Media                   0     1       0      4       0       0
##   Reliance                           0     2       0      0       0       0
##   Roadside Attractions               0     1       0      0       0       0
##   Rocky Mountain                     0     1       0      0       0       0
##   Samuel Goldwyn                     0     1       0      0       0       0
##   Screen Gems                        0     0       0      0       0       0
##   Sony                               1    11       0      2       0       2
##   Spyglass Entertainment             0     1       0      0       0       0
##   Summit                             0     6       0      2       0       1
##   TriStar                            0     0       0      0       0       0
##   Universal                          0     5       1      2       0       2
##   UTV                                0     0       0      0       0       0
##   Vertigo                            0     0       0      0       0       0
##   Village Roadshow                   0     0       0      0       0       0
##   Virgin                             0     0       0      0       0       0
##   Warner Bros                        0    10       2      7       3       0
##   Weinstein                          1     5       0      4       0       0
##   Yash Raj                           0     0       0      0       0       0
##                           
##                            Romance Thriller
##                                  0        0
##   A24                            0        0
##   Aardman Animations             0        0
##   ARC Entertainment              0        0
##   Atlas Distribution             0        0
##   Buena Vista                    0        0
##   CBS                            1        0
##   Cinedigm Entertainment         0        0
##   Cohen Media                    0        0
##   Columbia                       0        1
##   Crest                          0        0
##   Disney                         0        1
##   DreamWorks                     0        0
##   Entertainment One              0        0
##   Eros                           0        0
##   FilmDistrict                   0        0
##   Focus                          0        1
##   Fox                            1        3
##   Fox Searchlight                0        0
##   Happy Madison                  1        0
##   Highlight Communications       0        0
##   IFC                            0        0
##   Independent                    7        8
##   LD Entertainment               0        0
##   Legendary Pictures             0        0
##   Liberty Starz                  0        1
##   Lionsgate                      0        0
##   Magnolia                       0        0
##   Mediaplex                      0        0
##   MGM                            0        2
##   Millenium Entertainment        0        0
##   Miramax                        0        0
##   Morgan Creek                   0        0
##   Music Box Films                0        0
##   New Line                       0        0
##   Open Road                      0        0
##   Oscillloscope                  0        0
##   Overture                       0        0
##   Paramount                      0        4
##   Pixar                          0        0
##   Radius-TWC                     0        0
##   Regency Enterprises            0        1
##   Relativity Media               0        0
##   Reliance                       0        0
##   Roadside Attractions           0        0
##   Rocky Mountain                 0        0
##   Samuel Goldwyn                 0        0
##   Screen Gems                    0        0
##   Sony                           1        2
##   Spyglass Entertainment         0        0
##   Summit                         2        2
##   TriStar                        0        0
##   Universal                      1        3
##   UTV                            0        0
##   Vertigo                        0        0
##   Village Roadshow               0        0
##   Virgin                         0        1
##   Warner Bros                    4        5
##   Weinstein                      2        0
##   Yash Raj                       0        0
barplot(table(df$LeadStudio, df$Genre))

df$Genre[df$Genre==""]<-NA

Analysis 4: One Categorical and One Quantitative Variable

Analyst: Nour Fouladi

My categorical and quantitative variables from the HollywoodMovies dataset are the Genre and AudienceScore. Through the boxplot, one can infer the differences of audience ratings depending on the genre of a movie, thus certain genres are more popular than others. The boxplot directly determines the genres that receive the least to greatest amounts of ratings.

Calculating summary stats for all columns of the dataset

summary(df)
##     Movie            LeadStudio        RottenTomatoes  AudienceScore  
##  Length:970         Length:970         Min.   : 0.00   Min.   :19.00  
##  Class :character   Class :character   1st Qu.:28.00   1st Qu.:49.00  
##  Mode  :character   Mode  :character   Median :52.00   Median :61.00  
##                                        Mean   :51.71   Mean   :61.27  
##                                        3rd Qu.:75.00   3rd Qu.:74.00  
##                                        Max.   :99.00   Max.   :96.00  
##                                        NA's   :57      NA's   :63     
##     Story              Genre           TheatersOpenWeek OpeningWeekend  
##  Length:970         Length:970         Min.   :   1     Min.   :  0.01  
##  Class :character   Class :character   1st Qu.:2054     1st Qu.:  5.30  
##  Mode  :character   Mode  :character   Median :2798     Median : 13.15  
##                                        Mean   :2495     Mean   : 20.62  
##                                        3rd Qu.:3285     3rd Qu.: 26.20  
##                                        Max.   :4468     Max.   :207.44  
##                                        NA's   :21       NA's   :1       
##  BOAvgOpenWeekend DomesticGross     ForeignGross       WorldGross     
##  Min.   :    28   Min.   :  0.06   Min.   :   0.00   Min.   :   0.10  
##  1st Qu.:  3528   1st Qu.: 17.57   1st Qu.:  16.67   1st Qu.:  38.36  
##  Median :  5983   Median : 40.41   Median :  46.66   Median :  88.18  
##  Mean   :  8563   Mean   : 68.16   Mean   : 101.24   Mean   : 169.01  
##  3rd Qu.:  9790   3rd Qu.: 89.25   3rd Qu.: 111.91   3rd Qu.: 202.31  
##  Max.   :147262   Max.   :760.50   Max.   :2021.00   Max.   :2781.50  
##  NA's   :25                        NA's   :94        NA's   :56       
##      Budget       Profitability       OpenProfit           Year     
##  Min.   :  0.00   Min.   :    2.3   Min.   :   0.16   Min.   :2007  
##  1st Qu.: 20.00   1st Qu.:  150.0   1st Qu.:  19.50   1st Qu.:2009  
##  Median : 35.00   Median :  254.8   Median :  34.61   Median :2010  
##  Mean   : 56.12   Mean   :  384.6   Mean   :  62.22   Mean   :2010  
##  3rd Qu.: 75.00   3rd Qu.:  418.0   3rd Qu.:  58.38   3rd Qu.:2012  
##  Max.   :300.00   Max.   :10175.9   Max.   :3373.00   Max.   :2013  
##  NA's   :73       NA's   :74        NA's   :75

Calculating summary stats for an individual column

summary(df$AudienceScore)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   19.00   49.00   61.00   61.27   74.00   96.00      63

Drawing the side-by-side boxplot

ggplot(data = df, mapping = aes(x = Genre, y = AudienceScore)) +
  geom_boxplot() + labs(y = "AudienceScore")
## Warning: Removed 63 rows containing non-finite values (stat_boxplot).

Drawing a neat, reordered side-by-side boxplot

ggplot(data = df, mapping = aes(x = reorder(Genre, AudienceScore, median, na.rm = TRUE), y = AudienceScore)) + geom_boxplot() + labs(x = "Genre", y = "AudienceScore")
## Warning: Removed 63 rows containing non-finite values (stat_boxplot).

Analysis 5: Two Quantitative Variable

Analyst: [Tourkish Hasnan]

Tourkish did not submit any files, despite the team constantly reaching out to him.

Recommendations

One interesting thing I noticed about the Profitability of the movies is that the outliers made so much more profit that the rest of the dataset. For example, the movie that made the most profit made 10176, while the Q3 value is 418, which is such a huge increase. This is definitely something to look more into. Another interesting thing I noticed was that Animation movies we the most frequented movie, by a lot. This is the case of overall animation movies and movies made specifically by Disney, so this is something to look more into as well.