An linear model of worldwide gross sales of the top 500 movies to the production cost of the movie.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
movies <- read.csv("https://raw.githubusercontent.com/johnnyboy1287/hw11Movie/main/top-500-movies.csv")


summary(movies)
##       rank       release_date          title               url           
##  Min.   :  1.0   Length:500         Length:500         Length:500        
##  1st Qu.:125.8   Class :character   Class :character   Class :character  
##  Median :250.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :250.5                                                           
##  3rd Qu.:375.2                                                           
##  Max.   :500.0                                                           
##                                                                          
##  production_cost     domestic_gross      worldwide_gross    
##  Min.   : 91000000   Min.   :        0   Min.   :0.000e+00  
##  1st Qu.:110000000   1st Qu.: 70471103   1st Qu.:2.122e+08  
##  Median :140000000   Median :131846962   Median :3.671e+08  
##  Mean   :149495400   Mean   :169611380   Mean   :4.698e+08  
##  3rd Qu.:175000000   3rd Qu.:218599795   3rd Qu.:6.484e+08  
##  Max.   :400000000   Max.   :936662225   Max.   :2.910e+09  
##                                                             
##  opening_weekend         mpaa              genre              theaters   
##  Min.   :    48558   Length:500         Length:500         Min.   :  30  
##  1st Qu.: 24218732   Class :character   Class :character   1st Qu.:3378  
##  Median : 41671198   Mode  :character   Mode  :character   Median :3735  
##  Mean   : 54292057                                         Mean   :3660  
##  3rd Qu.: 68123912                                         3rd Qu.:4065  
##  Max.   :357115007                                         Max.   :4802  
##  NA's   :21                                                NA's   :21    
##     runtime           year     
##  Min.   : 76.0   Min.   :1991  
##  1st Qu.:104.5   1st Qu.:2007  
##  Median :120.0   Median :2012  
##  Mean   :121.9   Mean   :2011  
##  3rd Qu.:135.0   3rd Qu.:2016  
##  Max.   :210.0   Max.   :2023  
##  NA's   :13      NA's   :1
movies.lm <- lm(production_cost ~ worldwide_gross, data = movies)

summary(movies.lm)
## 
## Call:
## lm(formula = production_cost ~ worldwide_gross, data = movies)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -98836642 -27860863  -7302415  22465456 191127614 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.182e+08  2.845e+06   41.55   <2e-16 ***
## worldwide_gross 6.663e-02  4.685e-03   14.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40300000 on 498 degrees of freedom
## Multiple R-squared:  0.2889, Adjusted R-squared:  0.2874 
## F-statistic: 202.3 on 1 and 498 DF,  p-value: < 2.2e-16
plot(production_cost ~ worldwide_gross, data=movies)

abline(movies.lm)

qqnorm(resid(movies.lm))
qqline(resid(movies.lm))

There seems to be too much deviation from the line in order for this to be an accurate model.