An linear model of worldwide gross sales of the top 500 movies to
the production cost of the movie.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
movies <- read.csv("https://raw.githubusercontent.com/johnnyboy1287/hw11Movie/main/top-500-movies.csv")
summary(movies)
## rank release_date title url
## Min. : 1.0 Length:500 Length:500 Length:500
## 1st Qu.:125.8 Class :character Class :character Class :character
## Median :250.5 Mode :character Mode :character Mode :character
## Mean :250.5
## 3rd Qu.:375.2
## Max. :500.0
##
## production_cost domestic_gross worldwide_gross
## Min. : 91000000 Min. : 0 Min. :0.000e+00
## 1st Qu.:110000000 1st Qu.: 70471103 1st Qu.:2.122e+08
## Median :140000000 Median :131846962 Median :3.671e+08
## Mean :149495400 Mean :169611380 Mean :4.698e+08
## 3rd Qu.:175000000 3rd Qu.:218599795 3rd Qu.:6.484e+08
## Max. :400000000 Max. :936662225 Max. :2.910e+09
##
## opening_weekend mpaa genre theaters
## Min. : 48558 Length:500 Length:500 Min. : 30
## 1st Qu.: 24218732 Class :character Class :character 1st Qu.:3378
## Median : 41671198 Mode :character Mode :character Median :3735
## Mean : 54292057 Mean :3660
## 3rd Qu.: 68123912 3rd Qu.:4065
## Max. :357115007 Max. :4802
## NA's :21 NA's :21
## runtime year
## Min. : 76.0 Min. :1991
## 1st Qu.:104.5 1st Qu.:2007
## Median :120.0 Median :2012
## Mean :121.9 Mean :2011
## 3rd Qu.:135.0 3rd Qu.:2016
## Max. :210.0 Max. :2023
## NA's :13 NA's :1
movies.lm <- lm(production_cost ~ worldwide_gross, data = movies)
summary(movies.lm)
##
## Call:
## lm(formula = production_cost ~ worldwide_gross, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98836642 -27860863 -7302415 22465456 191127614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.182e+08 2.845e+06 41.55 <2e-16 ***
## worldwide_gross 6.663e-02 4.685e-03 14.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40300000 on 498 degrees of freedom
## Multiple R-squared: 0.2889, Adjusted R-squared: 0.2874
## F-statistic: 202.3 on 1 and 498 DF, p-value: < 2.2e-16
plot(production_cost ~ worldwide_gross, data=movies)
abline(movies.lm)

qqnorm(resid(movies.lm))
qqline(resid(movies.lm))

There seems to be too much deviation from the line in order for this
to be an accurate model.