Project 1

Author

Ryan Juica

Introduction

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggfortify)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

setwd("C:/Users/ryanj/Downloads/DATA110")
anim <- read_csv('animation_movies_enriched_1878_2029.csv')

Rows: 25390 Columns: 44
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (25): IMDb_ID, Movie_Name, Original_Title, Studio, All_Production_Compa...
dbl  (12): Movie_ID, TMDB_ID, Release_Year, Movie_Length_Minutes, TMDB_Ratin...
lgl   (6): Is_Released, Is_TV_Compilation, Hidden_Gem, Is_Adult_Content, Run...
date  (1): Release_Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(anim)

# A tibble: 6 × 44
  Movie_ID TMDB_ID IMDb_ID   Movie_Name Original_Title Release_Year Release_Date
     <dbl>   <dbl> <chr>     <chr>      <chr>                 <dbl> <date>      
1        1  922079 <NA>      La Nageuse La Nageuse             1878 1878-05-07  
2        2  922018 <NA>      Le Fumeur  Le Fumeur              1878 1878-05-07  
3        3  922081 <NA>      Le Steepl… Le Steeple-ch…         1878 1878-05-07  
4        4  922011 <NA>      Le Trapèze Le Trapèze             1878 1878-05-07  
5        5  922184 tt271192… Les Chien… Les Chiens Sa…         1878 1878-05-07  
6        6  921938 <NA>      The Aquar… L'Aquarium             1878 1878-05-07  
# ℹ 37 more variables: Studio <chr>, All_Production_Companies <chr>,
#   Director <chr>, Country_Origin <chr>, Original_Language <chr>,
#   Spoken_Languages <chr>, Animation_Style <chr>, Genre <chr>, Theme <chr>,
#   Overview <chr>, Movie_Length_Minutes <dbl>, TMDB_Rating <dbl>,
#   TMDB_Vote_Count <dbl>, TMDB_Popularity <dbl>, Budget_Million_USD <dbl>,
#   Box_Office_Million_USD <dbl>, MPAA_Rating <chr>, Target_Audience <chr>,
#   Voice_Cast <chr>, Live_Action_Remake <chr>, Belongs_To_Collection <chr>, …

Cleaning

cleaned_anim <- anim |>
  filter(
    !is.na(Budget_Million_USD) & 
    !is.na(Box_Office_Million_USD) & 
    !is.na(Movie_Name) & 
    !is.na(Animation_Style) & 
    Budget_Million_USD > 0.5 & 
    Box_Office_Million_USD > 0.5
    )
  
small_anim <- cleaned_anim %>%
  select(Movie_Name, Budget_Million_USD, Box_Office_Million_USD, Animation_Style, TMDB_Popularity, Movie_Length_Minutes, TMDB_Rating)

head(small_anim)

# A tibble: 6 × 7
  Movie_Name           Budget_Million_USD Box_Office_Million_USD Animation_Style
  <chr>                             <dbl>                  <dbl> <chr>          
1 Snow White and the …               1.49                 185.   2D Traditional 
2 Fantasia                           2.28                  76.4  2D Traditional 
3 Pinocchio                          2.6                  164    2D Traditional 
4 Dumbo                              0.81                   1.6  2D Traditional 
5 Bambi                              0.86                 267.   2D Traditional 
6 Victory Through Air…               0.8                    0.79 2D Traditional 
# ℹ 3 more variables: TMDB_Popularity <dbl>, Movie_Length_Minutes <dbl>,
#   TMDB_Rating <dbl>

#summary(small_anim)

model <- lm(Box_Office_Million_USD ~ Budget_Million_USD + TMDB_Popularity + Movie_Length_Minutes + TMDB_Rating, data = small_anim)

summary(model)


Call:
lm(formula = Box_Office_Million_USD ~ Budget_Million_USD + TMDB_Popularity + 
    Movie_Length_Minutes + TMDB_Rating, data = small_anim)

Residuals:
     Min       1Q   Median       3Q      Max 
-1045.11   -74.34   -14.06    35.91  1817.61 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -319.1529    74.5712  -4.280 2.17e-05 ***
Budget_Million_USD      3.0770     0.1617  19.030  < 2e-16 ***
TMDB_Popularity         4.3485     0.7883   5.516 5.09e-08 ***
Movie_Length_Minutes    1.2657     0.6307   2.007  0.04522 *  
TMDB_Rating            28.2402     9.7612   2.893  0.00395 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 199.7 on 619 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.487, Adjusted R-squared:  0.4837 
F-statistic: 146.9 on 4 and 619 DF,  p-value: < 2.2e-16

p <- ggplot(small_anim, aes(x = Budget_Million_USD, y = Box_Office_Million_USD, color = Animation_Style)) +
  labs(title = "Movie Budget vs. Box Office Revenue",
       caption = "Source: TMDB",
       x = "Budget",
       y = "Box Office",
       color = "Animation Style"
       )+
       theme_minimal() +
       theme(
       legend.position = "bottom"
       ) +
       scale_color_brewer(palette = "Set1") +
       geom_smooth(method = "lm", se = FALSE)

p + geom_point(alpha = 0.5, size = 1)

`geom_smooth()` using formula = 'y ~ x'

At the end of your document, write a second brief essay (incorporated directly into your Markdown file). The essay should describe:

a. How you cleaned the dataset up (be detailed and specific, using proper terminology where appropriate).

I started by narrowing down the big dataset with 44 variables into a smaller one called small_anim so it would be easier to see what I was working with. I focused on things like the budget, box office revenue, and animation style. I did a lot of filtering and removed any movie titles that had NA values in it as well as budget, and box office. I also noticed that some moves had 0 dollars for their budget and box office. The filtering for NA didn’t catch that even though it is considered missing values. So, I decided to include only movies that had a budget and box office over 0.5 million dollars. It also got rid of any small studio movies that wouldn’t give much insight into the whole picture.

b. What the visualization represents, any interesting patterns or surprises that arise within the visualization.

The scatter plot that I created shows that there is a big connection between the budget and box office revenue. The higher the budget the higher the box office usually is. The trend line also makes that connection easier to see. The graph also shows that 3D CGI movies tend to make more than other styles. There are very few other styles that do make higher box office venue and it’s interesting to see that connection.

c. Anything that you might have shown that you could not get to work or that you wished you could have included.

I wanted to show what movie, a dot, came from what studio to see what studio usually makes the most or what studio seems to be continually making the most. Adding the release year could also help. However, all of that would be very clustered and make the scatter plot hard to read.