To: Professor Perine

From: Mohamed Nabeel, Campbell Cash, Jiin Kim, Nour Fouladi, Turk Hasnan

Subject: Hollywood Movies Data Set

Background

Our task as a team is to analyze the dataset to see which movies had the most profit, which genres were the most popular, which studios made what genre movies and comparing audience reviews with different genres. The original data set was pulled from https://www.lock5stat.com/datapage.html. The data set has 16 different columns and 970 rows of data, meaning it is a medium sized data set. Following-up on the last analysis of the data, we analyzed these different variables even further. We compared the different sub-variables within the variables and compare them with each other.
Variable, Type, Meaning Profitably, Int, How much Profit a movie made Genre, Character, Type of Movie Story, Genre, Character, Character, Type of Story, Type of Movie AudienceScore, Genre, Int, Character Audience ratings on the movie, Type of Movie DomesticGross, ForeignGross, Int, Int, Money made in Country, Money made Globally

Inference Analyses

Analysis 1: One Mean - One Quantitative Variable, Profitability

Analyst: Mohamed Nabeel

library(Rmisc)
## Warning: package 'Rmisc' was built under R version 4.0.3
## Loading required package: lattice
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 4.0.3
df<-read.csv("HollywoodMovies.csv")
clean_Profitability<-df$Profitability[!is.na(df$Profitability)]

#Confidence Interval

CI(clean_Profitability, ci=0.95)
##    upper     mean    lower 
## 426.0362 384.6201 343.2040
t.test(clean_Profitability, conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  clean_Profitability
## t = 18.226, df = 895, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  343.2040 426.0362
## sample estimates:
## mean of x 
##  384.6201
histogram(clean_Profitability)

The CLT requirements are that the variables are independent and the sample size should be 10 expected successes and 10 expected failure. Because this is a random sample, the first requirement is fulfilled. Looking at the confidence interval, we are 95% confident that the profitability of a movie is between 343 million and 426 million. One limitation is that I am limited to the data I have from the .csv file, so some of the numbers may be out of date. Another limitation is that I am only looking at the one profitability variable.

qqnorm(clean_Profitability)
qqline(clean_Profitability)

shapiro.test(clean_Profitability)
## 
##  Shapiro-Wilk normality test
## 
## data:  clean_Profitability
## W = 0.38341, p-value < 2.2e-16

The data is sufficiently “normal”, despite the skew. Also, the sample size is large. However, sigma is not known. So we use a t-distribution for confidence interval and hypothesis testing.

summary(clean_Profitability)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.3   150.0   254.8   384.6   418.0 10175.9
t.test(clean_Profitability, mu = 373.2)
## 
##  One Sample t-test
## 
## data:  clean_Profitability
## t = 0.54117, df = 895, p-value = 0.5885
## alternative hypothesis: true mean is not equal to 373.2
## 95 percent confidence interval:
##  343.2040 426.0362
## sample estimates:
## mean of x 
##  384.6201

#Hypothesis Testing The average profit from a movie is 373.2 million, retrieved from https://stephenfollows.com/how-movies-make-money-hollywood-blockbusters/#:~:text=Theatrical%20is%20the%20largest%20income,office%20gross%20of%20%24373.2%20million. The data meets the conditions of preforming a test using the normal distribution because there is variance within the data set.

Ho: P = 373.2 Ha: P ≠ 373.2 The p-value is less than the value of significance being at 0.99, so we reject the null hypothesis. There is sufficient evidence that the true mean profit of movies is 373.2 million. The results of the confidence interval and the hypothesis test do corroborate because 373.2 falls within the 95% confidence interval, being (343.2040, 426.0362).

Analysis 2: One Proportion - One categorical variable,Genre.

Analyst: Jiin Kim

My analysis inference with one proportion for one categorical variable with this Hollywood movie data set is comedy. For my confidence interval, I chose a categorical variable comedy and constructed a 90% confidence interval for the parameter eliminating all missing and NA. This point estimate is a statistic that is calculated from the sample data and serves as a best guess of an unknown population parameter. The population mean is the population parameter and the sample mean is the point estimate, which is our best guess of the population mean. Population parameters are typically unknown because we rarely measure the whole population. Interpreting a 90% confidence interval, the true population parameter (population mean) is between 0.2288 and 0.2835. For 90% confidence, I used values Z* = 1.645. The Z* value corresponds to the middle 90% of a standard distribution due to the Central Limit Theorem conditions being met.

Check conditions: number of expected successes = 346 and number of expected failures = 346 Single proportion – success: Comedy. Summary statistics: p_hat = 0.2562 ; n = 691 Check conditions: number of successes = 177 ; number of failures = 514 Standard error = 0.0166, Test statistic: Z = -12.82 p-value = 1

Hypothesis test: HO : p = 0.5 HA: p > 0.5 since my p-value is greater than alpha then do not reject HO and do not support HA. Comparison from an external source. Percentage of movies by genre, country of production and year of first release. a Genres, b Countries, c Years of first release, grouped by decade. The most active one is “drama”, followed by “comedy,” and comedy is 34%. Similarly Hollywood movie data set, comedy is 26% which is the most popular genre. For outliers, there is NA and movies without a genre. Cleaning data helps better analyze genre with this data set.

https://www.researchgate.net/figure/Percentage-of-movies-by-genre-country-of-production-and-year-of-first-release-a-Genres_fig1_329323623

# Clean Genre.
HollywoodMovies <- subset(df, Genre != "")

# Create failures by renaming to "Action". Success/Failure condition.
HollywoodMovies$Genre[HollywoodMovies$Genre != "Comedy"] <- "Action"

# Rename levels.
levels(HollywoodMovies$Genre)[levels(HollywoodMovies$Genre)=="Action"] <- "Other"
summary(HollywoodMovies)
##     Movie            LeadStudio        RottenTomatoes  AudienceScore  
##  Length:691         Length:691         Min.   : 0.00   Min.   :19.00  
##  Class :character   Class :character   1st Qu.:27.00   1st Qu.:48.00  
##  Mode  :character   Mode  :character   Median :49.00   Median :60.00  
##                                        Mean   :49.83   Mean   :60.57  
##                                        3rd Qu.:73.00   3rd Qu.:73.00  
##                                        Max.   :99.00   Max.   :96.00  
##                                        NA's   :22      NA's   :22     
##     Story              Genre           TheatersOpenWeek OpeningWeekend   
##  Length:691         Length:691         Min.   :   1     Min.   :  0.032  
##  Class :character   Class :character   1st Qu.:2284     1st Qu.:  6.933  
##  Mode  :character   Mode  :character   Median :2809     Median : 13.600  
##                                        Mean   :2620     Mean   : 21.585  
##                                        3rd Qu.:3286     3rd Qu.: 26.812  
##                                        Max.   :4468     Max.   :174.140  
##                                        NA's   :20       NA's   :1        
##  BOAvgOpenWeekend DomesticGross     ForeignGross       WorldGross     
##  Min.   :  151    Min.   :  0.36   Min.   :   0.01   Min.   :   1.10  
##  1st Qu.: 3767    1st Qu.: 20.73   1st Qu.:  16.41   1st Qu.:  40.45  
##  Median : 5921    Median : 41.80   Median :  44.73   Median :  85.56  
##  Mean   : 8290    Mean   : 69.98   Mean   :  96.76   Mean   : 167.17  
##  3rd Qu.: 9739    3rd Qu.: 90.70   3rd Qu.: 103.14   3rd Qu.: 199.47  
##  Max.   :93230    Max.   :760.50   Max.   :2021.00   Max.   :2781.50  
##  NA's   :24                        NA's   :30        NA's   :20       
##      Budget       Profitability       OpenProfit           Year     
##  Min.   :  0.00   Min.   :   3.68   Min.   :   0.17   Min.   :2007  
##  1st Qu.: 20.00   1st Qu.: 147.02   1st Qu.:  21.25   1st Qu.:2008  
##  Median : 37.00   Median : 249.86   Median :  35.91   Median :2009  
##  Mean   : 55.54   Mean   : 367.96   Mean   :  58.30   Mean   :2009  
##  3rd Qu.: 75.00   3rd Qu.: 400.67   3rd Qu.:  58.91   3rd Qu.:2011  
##  Max.   :300.00   Max.   :6694.40   Max.   :1368.00   Max.   :2013  
##  NA's   :26       NA's   :27        NA's   :28
# Factor.
HollywoodMovies$Genre <- factor(HollywoodMovies$Genre)
# Show inference.
source("http://bit.ly/dasi_inference")
inference(HollywoodMovies$Genre, est = "proportion", type = "ci", conflevel = 0.9, method = "theoretical", 
          success = "Comedy")
## Single proportion -- success: Comedy 
## Summary statistics:

## p_hat = 0.2562 ;  n = 691 
## Check conditions: number of successes = 177 ; number of failures = 514 
## Standard error = 0.0166 
## 90 % Confidence interval = ( 0.2288 , 0.2835 )
# Show hypothesis test.
inference(HollywoodMovies$Genre, est = "proportion", type = "ht", null = 0.5, method = "theoretical", 
          success = "Comedy", alternative = "greater")
## Single proportion -- success: Comedy 
## Summary statistics:
## p_hat = 0.2562 ;  n = 691 
## H0: p = 0.5 
## HA: p > 0.5 
## Check conditions: number of expected successes = 346 ; number of expected failures = 346 
## Standard error = 0.019 
## Test statistic: Z =  -12.82 
## p-value =  1

For success and failure condition, I use new variables with inference around the proportion of comedies in Hollywood movies data set. I renamed number of comedies “Comedy” to “success” and all other categories are failure “Other” so that there is no multiple levels to my factors in this analysis and no levels with a value of zero.

# Calculate sample size and p_hat.
n <- 691
obs <- 177
p_hat <- obs/n
p_null <- 0.50
se_null <- sqrt(p_null*(1 - p_null)/n)
z_crit<- 1.645

se <- sqrt(p_hat*(1 - p_hat)/n)
CIleft <- p_hat - z_crit*se
CIright <- p_hat + z_crit*se
CIleft
## [1] 0.2288345
CIright
## [1] 0.2834665
# Calculate a test statistic = z-score of our sample stat, relative to the null
# sampling statistic.
z_test <- (p_hat - p_null)/se_null
z_test
## [1] -12.82008
# Use pnorm with an upper tail to find this probability relative to the standard
# normal distribution
p_value <- pnorm(z_test, 0, 1, lower.tail=FALSE)
p_value
## [1] 1
# Use prop.test to do the same thing.
prop.test(obs, n=n, p=p_null, alternative = "greater")$p.value
## [1] 1

Analysis 3: Two Proportions - G

Analyst:Campbell Cash

For the confidence interval, I used the number 100 for both stories and genres. Due to the populations being different, I didn’t think it was necessary to make the samples different since their proportions won’t turn out the same. This has come after cleaning out all the null values from both the ‘Story’ and ‘Genre’ columns. After making the confidence interval, I saw that the left side was negative, so I turned it into a 0. The work in “r confidence-interval” below results in the answer of (0, 6.1%). With 99% confidence, I can say the difference in proportions between genres and stories is 0 to 6.1%.

The data for the confidence interval follows the conditions of CLT, being that the sample sizes are more than 10% of the population and np is >= 10. A limitation I have of my analysis is the fact that it’s hard to interpret a confidence interval for the two variables, since I’m not exactly sure what it means. It’s possible that this confidence interval is saying that 6.1% more of stories (not null) are visible compared to genres (not null). Another limitation is that I can’t make any graph representing a difference due to the populations being different, I’d have to use the data frame itself.

Hypothesis test:

H0: p1-p2 = 0, there is no difference between the amount of stories and genres (no null values) in the dataset

HA: p1-p2 =/= 0, there is a difference between the amount of stories and genres (no null values) in the dataset

p1-p2 = .0112

At a 1% significance level, I reject the null hypothesis. Due to the confidence interval possibly indicating that after being cleaned of null values, 6.1% more of genres are visible compared to stories, it was clear that the alternative hypothesis being true was the conclusion.

A limitation I have is the same as before with the graphing. I can’t make a graph that can represent any kind of difference due to the variables not being numerical. I can make a graph that indicates the relationship between stories and genres, but it has to be from the original data frame itself, like the bar graph below, and not from the cleaned up versions of the stories and genres due to the population differences. The graphs I can make won’t be related to difference in proportions.

p1 <- 100/641 #Stories
p2 <- 100/691 #Genres
p <- p1-p2
n1 <- 691
n2 <- 641
se <- sqrt((p1*(1-p1)/n1)+(p2*(1-p2)/n2))
z <- 2.576
p + z*se
## [1] 0.06174409
p - z*se
## [1] -0.03916721
barplot(table(df$Story, df$Genre))

Analysis 4:Two Independent Means - [Categorical: Genres - Quantative: AudienceScore]

Analyst:[Nour Fouladi]

[These samples are independent because they exist without having to correspond with one another.]

[The response variable is the AudienceScore while the Explanatory variable is the Genre. The Genre exists in descriptive nature while the the AudienceScore is dependent on the viewers ratings]

[H0: There’s no significant differnece in audience ratings based on the genre of the movies]

[HA: There’s a significant difference in audience ratings based on the genre of the movies]

[We can be 95% confident that there’s a significant difference in audience ratings based on the genre of the movies]

#Graph

library(ggplot2)

ggplot(data = df, mapping = aes(x = Genre, y = AudienceScore)) +
  geom_boxplot() + labs(y = "AudienceScore")
## Warning: Removed 63 rows containing non-finite values (stat_boxplot).

#Numerical summary

by(df$AudienceScore, df$Genre, summary)
## df$Genre: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   22.00   52.00   63.00   63.24   76.00   92.00      41 
## ------------------------------------------------------------ 
## df$Genre: Action
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    19.0    47.0    58.0    59.3    72.0    93.0       1 
## ------------------------------------------------------------ 
## df$Genre: Adventure
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   31.00   50.25   62.00   63.47   81.00   87.00 
## ------------------------------------------------------------ 
## df$Genre: Animation
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   37.00   54.50   70.00   66.29   76.50   91.00 
## ------------------------------------------------------------ 
## df$Genre: Biography
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   37.00   65.25   72.00   72.00   84.50   93.00 
## ------------------------------------------------------------ 
## df$Genre: Comedy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   28.00   46.00   56.00   56.39   67.25   93.00       5 
## ------------------------------------------------------------ 
## df$Genre: Crime
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   37.00   47.50   63.00   62.53   75.00   90.00 
## ------------------------------------------------------------ 
## df$Genre: Documentary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    42.0    72.0    78.0    72.2    82.0    87.0       2 
## ------------------------------------------------------------ 
## df$Genre: Drama
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   20.00   53.00   66.00   65.41   81.00   92.00      12 
## ------------------------------------------------------------ 
## df$Genre: Fantasy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   56.00   63.25   70.50   72.00   79.25   92.00 
## ------------------------------------------------------------ 
## df$Genre: Horror
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   22.00   38.50   49.00   49.27   58.50   83.00       1 
## ------------------------------------------------------------ 
## df$Genre: Musical
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   55.00   74.50   82.00   75.75   83.25   84.00 
## ------------------------------------------------------------ 
## df$Genre: Mystery
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      44      48      50      55      65      68 
## ------------------------------------------------------------ 
## df$Genre: Romance
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   38.00   52.00   68.00   65.68   77.00   84.00       1 
## ------------------------------------------------------------ 
## df$Genre: Thriller
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   58.00   70.00   66.54   74.50   96.00
by(df$AudienceScore, df$Genre, sd)
## df$Genre: 
## [1] NA
## ------------------------------------------------------------ 
## df$Genre: Action
## [1] NA
## ------------------------------------------------------------ 
## df$Genre: Adventure
## [1] 16.99033
## ------------------------------------------------------------ 
## df$Genre: Animation
## [1] 14.93626
## ------------------------------------------------------------ 
## df$Genre: Biography
## [1] 15.07928
## ------------------------------------------------------------ 
## df$Genre: Comedy
## [1] NA
## ------------------------------------------------------------ 
## df$Genre: Crime
## [1] 16.87715
## ------------------------------------------------------------ 
## df$Genre: Documentary
## [1] NA
## ------------------------------------------------------------ 
## df$Genre: Drama
## [1] NA
## ------------------------------------------------------------ 
## df$Genre: Fantasy
## [1] 13.16055
## ------------------------------------------------------------ 
## df$Genre: Horror
## [1] NA
## ------------------------------------------------------------ 
## df$Genre: Musical
## [1] 13.88944
## ------------------------------------------------------------ 
## df$Genre: Mystery
## [1] 10.77033
## ------------------------------------------------------------ 
## df$Genre: Romance
## [1] NA
## ------------------------------------------------------------ 
## df$Genre: Thriller
## [1] 15.75426
#95% Confidence interval

p1 <- 10/63.24 #Mean of Genres
p2 <- 10/61.27 #Mean of AudienceScore
p <- p1-p2 #difference of means
n1 <- 63.24
n2 <- 61.27
se <- sqrt((p1*(1-p1))+(p2*(1-p2)))
z <- 1.96
p + z*se
## [1] 1.01279
p - z*se
## [1] -1.022959
#-1.022959 to 1.01279

Analysis 5: Paired Data Difference of Means

Analyst: Turk Hasnan

My analysis is with the paired data difference of means between how much the movies made domestically versus how much they made in foreign sales. The explanatory variable would be the movie, and the response would be both the domestic gross and the foreign gross. The domestic gross and foreign gross depend on what the movie is. Some movies aren’t available in foreign countries resulting in NA as a value.

df <- read.csv("HollywoodMovies.csv")
library(ggplot2)

dgross <- df$DomesticGross
fgross <- df$ForeignGross

summary(dgross)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.06   17.57   40.41   68.16   89.25  760.50
summary(fgross)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   16.67   46.66  101.24  111.91 2021.00      94

We need to remove the NA’s to see the true mean of the foreign gross and compare the two summaries.

clean_fgross<- fgross[!is.na(fgross)]

summary(clean_fgross)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   16.67   46.66  101.24  111.91 2021.00
summary(fgross)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   16.67   46.66  101.24  111.91 2021.00      94

Removing the NA’s does not affect the dataset, so we will use uncleaned dataset. Compare domestic gross and foreign gross with a side-by-side boxplot.

boxplot(dgross, fgross, col="orange", main="Gross Sales", ylab="Sales (millions)", xlab="Domestic/Foreign") 

Perform a paired t-test with a confidence level of 95% to compare the two means of sales

H0: Mean difference in domestic and foreign sales is 0 Using a two-sided test

t.test(dgross, fgross, mu = 0, alt = "two.sided", paired = T, conf.level = 0.95)
## 
##  Paired t-test
## 
## data:  dgross and fgross
## t = -8.2914, df = 875, p-value = 4.193e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -33.00825 -20.37244
## sample estimates:
## mean of the differences 
##               -26.69034

According to the paired t-test, we are 95% confident that the difference between the means of foreign sales and domestic sales for every movie is between 20.37m and 33.01m dollars, favoring foreign sales.

Hypothesis Test:

H0: mu1 - mu2 = 0 HA: mu1 - mu2 > 0

Since the p-value is very close to zero and the t statistic is negative, I will reject the null hypothesis. Accepting the alternative hypothesis that the difference in means between domestic gross and foreign gross of all movies is greater, favoring foreign sales.

A limitation was that there was one huge outlier in foreign sales that had sales in 2021 million and that was Avatar. Which could have altered the data since it made a huge amount in sales.

Recommendations

One limitation that we faced in our analysis is that we are limited to the data in our csv file, so the numbers are not real time, as people go back and watch movies over and over again. Another limitation is that even though all the NA values were removed, some of the variables are going to be effected by potential outliers. One statistically significant finding we evaluated was that there is a difference between the amount of stories and genres (no null values) in the dataset. We accepted the alternative hypothesis because of the p-value. This is relevant to our data and does give us new insights because now we can look further into these variables and sub-variables to test more specific data for the next assignment.