Films are a type of motion picture that involves the showing of some type of story. Typically, individuals known as actors, portray specific characters relative to the world or universe in which the story takes place. This serves as a form of entertainment since people enjoy watching them. Due to this, the making of films is a huge industry/business. When films are released, they are typically viewed in theaters, in which people pay a certain price to view the film. Total Gross is calculated by multiplying Tickets Sold by Average Ticket Price. Since Total Gross is derived from Tickets Sold at Average Price, they are basically the same variable from a conceptual point of view. However, it also makes sense to think that there could be a relationship between Tickets Sold and Average Ticket Price since, from an economic perspective, the number of Tickets Sold, otherwise known as demand, is directly dependent on the price of the ticket. The goal of this assignment will examine two relationships. The first one is to see if there is a relationship between Average Ticket Price and Total Gross, and the second will be to see if there is a relationship between Average Ticket Price and Tickets Sold. This will be carried out by creating regression models that are linear and/or polynomial and analyzing their summaries.
The data set originated from Box Office Mojo, which was part of IMDb. Box Office Mojo tracks the yearly info for films and ranges from 1977-2019. In particular, it is responsible for keeping track of a year’s total gross, gross change, number of releases, the per release average gross, and the number one release. This data set, however, was a cleaned version of the online data, so some of the values were different. In addition to this, this data set was an older version, so it only went to 2017 rather than 2019. Due to this, the values for Total Gross, Tickets Sold, and Average Ticket Price were different.
The following code was used to get the linear model summary for Average Ticket Price and Total Gross:
Profit <- read.csv("YearlyBoxOffice.csv")
model1<-lm(Total_Gross~Avg_Ticket_Price,Profit, model=TRUE)
summary(model1)
##
## Call:
## lm(formula = Total_Gross ~ Avg_Ticket_Price, data = Profit, model = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1242.27 -402.94 -72.24 421.18 1455.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -783.52 317.54 -2.467 0.0185 *
## Avg_Ticket_Price 1460.17 54.92 26.588 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 631.9 on 36 degrees of freedom
## Multiple R-squared: 0.9515, Adjusted R-squared: 0.9502
## F-statistic: 706.9 on 1 and 36 DF, p-value: < 2.2e-16
model1[1]
## $coefficients
## (Intercept) Avg_Ticket_Price
## -783.5246 1460.1666
Since the relationship for Average Ticket Price and Total Gross hasn’t been determined, a null and alternative hypothesis will be made. The null hypothesis will be that there isn’t a relationship between Average Ticket Price and Total Gross. The alternative hypothesis is that there is a relationship between Average Ticket Price and Total Gross.
Profit <- read.csv("YearlyBoxOffice.csv")
attach(Profit)
plot(Avg_Ticket_Price, Total_Gross, main="Total Gross vs. Average Ticket Price",
xlab="Average Ticket Price ", ylab="Total Gross ", pch=19)
abline(lm(Total_Gross~Avg_Ticket_Price), col = "black")
FIG 1. This scatter plot shows the relationship between Average Ticket Price and Total Gross. The line of best fit is represented by the black line. The units for Average Ticket Price and Total Gross are $.
regression.line = lm(Total_Gross~Avg_Ticket_Price)
Residuals = resid(regression.line)
predicted = predict(regression.line)
plot(Avg_Ticket_Price, Residuals,
main = "Residual Plot of Total Gross vs. Average Ticket Price")
FIG 1A. This scatter plot shows the residuals for Total Gross vs. Average Ticket Price.
The following code was done to get the polynomial model summary for Average Ticket Price and Tickets Sold:
Profit <- read.csv("YearlyBoxOffice.csv")
model2<-lm(Tickets_Sold~Avg_Ticket_Price+I(Avg_Ticket_Price^2), Profit, model=TRUE)
summary(model2)
##
## Call:
## lm(formula = Tickets_Sold ~ Avg_Ticket_Price + I(Avg_Ticket_Price^2),
## data = Profit, model = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.50 -62.44 13.76 62.22 156.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 114.261 157.198 0.727 0.472
## Avg_Ticket_Price 410.273 58.726 6.986 3.98e-08 ***
## I(Avg_Ticket_Price^2) -31.944 5.021 -6.362 2.58e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 88.75 on 35 degrees of freedom
## Multiple R-squared: 0.6576, Adjusted R-squared: 0.638
## F-statistic: 33.6 on 2 and 35 DF, p-value: 7.168e-09
model2[1]
## $coefficients
## (Intercept) Avg_Ticket_Price I(Avg_Ticket_Price^2)
## 114.26063 410.27341 -31.94367
Since the relationship for Average Ticket Price and Tickets Sold hasn’t been determined, a null and alternative hypothesis will be made. The null hypothesis will be that there isn’t a relationship between Average Ticket Price and Tickets Sold. The alternative hypothesis is that there is a relationship between Average Ticket Price and Tickets Sold.
Profit <- read.csv("YearlyBoxOffice.csv")
plot(Avg_Ticket_Price,Tickets_Sold, main="Tickets Sold vs. Average Ticket Price",
xlab="Average Ticket Price ", ylab="Tickets Sold ", pch=19)
fit <- lm(Tickets_Sold ~ 1 + Avg_Ticket_Price + I(Avg_Ticket_Price^2))
points(Avg_Ticket_Price, predict(fit), type="l")
FIG 2. This scatter plot shows the relationship between Average Ticket Price and Tickets Sold. The line of best fit/curve is represented by the black line. The units for Average Ticket Price is $.
regression.line = lm(Tickets_Sold~Avg_Ticket_Price)
Residuals = resid(regression.line)
predicted_2 = predict(regression.line)
plot(Avg_Ticket_Price, Residuals,
main = "Residual Plot of Tickets Sold vs. Average Ticket Price")
FIG 2A. This scatter plot shows the residuals for Tickets Sold vs. Average Ticket Price.
For FIG 1., the R squared was 0.9515. This indicated that Avg_Ticket_Price accounted for 95.15% of the variability in Total_Gross. This r-value was very high, and was almost at the point of being close to 1. This indicated that there was a strong positive relationship between the two variables, and that Avg_Ticket_Price was able to effectively predict Total_Gross. Looking at the p-values for the slope and the intercept, it was shown that they were statistically significant since they were less than their alpha levels. For the intercept, 0.0185 <0.05, and for the slope, 2e-16 < 0.001. Looking at the residuals, and the residual plot, it was noticeable that a linear model was not the best fit for the data. While the residuals were spread out, and weren’t clustered close together, they did have a distinct pattern to them. There appeared to be a sharp turn at 5.5, where the residuals went from increasing to decreasing. In addition to this, the residual summary also showed that a linear model was not appropriate since they weren’t symmetrically balanced out, and the median wasn’t close to zero. Looking at the coefficients, the model said that for every additional Avg_Ticket_Price, the expected Total_Gross would increase by about $1460.17. A line of best fit could be modeled by Predicted Total Gross = 1460.17(Average Ticket Price) -783.52. Since the p values for the slope and intercept were lower than their respective alpha levels, the null hypothesis was rejected and the alternative hypothesis was accepted. Due to this, it could be inferred that there was a relationship between the two variables. By looking at the model, it was shown that as the Avg_Ticket_Price increased, the Total_Gross increased too. For FIG 2., the R squared was 0.6576. This indicated that Avg_Ticket_Price accounted for 65.76% of the variability in Tickets_Sold. This r-value was moderately high and was above 0.5. This indicated that there was a moderate positive relationship between the two variables, and that Avg_Ticket_Price was able to effectively predict Total_Gross. Looking at the p-values for the Average Tickets Sales and Average Tickets Sales Squared, it was shown that they were statistically significant since they were less than their alpha levels. For the Average Ticket Sales, 3.98e-08 <0.001, and for Average Ticket Sales Squared, 2.58e-07< 0.001. Looking at the residuals, and the residual plot, it was noticeable that the relationship between the two variables wasn’t suited for a linear model. Instead, a quadratic model was used. Due to this, the line of best fit/curve could be modeled by Predicted Tickets Sold = 410.273(Average Ticket Price) -31.94367(Average Ticket Price)^2 +114.261. Since the p values for the slope and intercept were lower than their respective alpha levels, the null hypothesis was rejected and the alternative hypothesis was accepted. Due to this, it could be inferred that there was a relationship between the two variables. By looking at the model, it was shown that as the Avg_Ticket_Price increased, the Tickets Sold increased to a specific point, after which the amount of tickets started to decrease.
It can be concluded that there was a strong relationship between Average Ticket Price and Total Gross. While there was an indication that there was strong correlation between the variables, it was shown that the relationship was not modeled the best by a linear model. While there was a high correlation, it did not necessarily mean that a linear model should have been used. In addition to this, it was also concluded that there was a moderately strong relationship between Average Ticket Price and Tickets Sold. While there was an indication that there was strong correlation between the variables, it was shown that the relationship was best modeled by a quadratic model. It is should be noted that the residual plots for Tickets Sold vs. Average Ticket Sales, and the plot Total Gross vs. Average Ticket Sales looked very similar in shape and appearance. This probably occurred due to the fact that Total Gross is derived from Tickets Sold at Average Ticket Price. Due to them being similar variables, it is possible that a quadratic model would’ve suited the data better. Since a quadratic worked for Tickets Sold, it is likely that a quadratic would’ve worked for Total Gross as well.
“Domestic Yearly Box Office.” Box Office Mojo, IMDb.com, Inc. or Its Affiliates, www.boxofficemojo.com/year/.