── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggfortify)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 25390 Columns: 44
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (25): IMDb_ID, Movie_Name, Original_Title, Studio, All_Production_Compa...
dbl (12): Movie_ID, TMDB_ID, Release_Year, Movie_Length_Minutes, TMDB_Ratin...
lgl (6): Is_Released, Is_TV_Compilation, Hidden_Gem, Is_Adult_Content, Run...
date (1): Release_Date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(anim)
# A tibble: 6 × 44
Movie_ID TMDB_ID IMDb_ID Movie_Name Original_Title Release_Year Release_Date
<dbl> <dbl> <chr> <chr> <chr> <dbl> <date>
1 1 922079 <NA> La Nageuse La Nageuse 1878 1878-05-07
2 2 922018 <NA> Le Fumeur Le Fumeur 1878 1878-05-07
3 3 922081 <NA> Le Steepl… Le Steeple-ch… 1878 1878-05-07
4 4 922011 <NA> Le Trapèze Le Trapèze 1878 1878-05-07
5 5 922184 tt271192… Les Chien… Les Chiens Sa… 1878 1878-05-07
6 6 921938 <NA> The Aquar… L'Aquarium 1878 1878-05-07
# ℹ 37 more variables: Studio <chr>, All_Production_Companies <chr>,
# Director <chr>, Country_Origin <chr>, Original_Language <chr>,
# Spoken_Languages <chr>, Animation_Style <chr>, Genre <chr>, Theme <chr>,
# Overview <chr>, Movie_Length_Minutes <dbl>, TMDB_Rating <dbl>,
# TMDB_Vote_Count <dbl>, TMDB_Popularity <dbl>, Budget_Million_USD <dbl>,
# Box_Office_Million_USD <dbl>, MPAA_Rating <chr>, Target_Audience <chr>,
# Voice_Cast <chr>, Live_Action_Remake <chr>, Belongs_To_Collection <chr>, …
Call:
lm(formula = Box_Office_Million_USD ~ Budget_Million_USD + TMDB_Popularity +
Movie_Length_Minutes + TMDB_Rating, data = small_anim)
Residuals:
Min 1Q Median 3Q Max
-1045.11 -74.34 -14.06 35.91 1817.61
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -319.1529 74.5712 -4.280 2.17e-05 ***
Budget_Million_USD 3.0770 0.1617 19.030 < 2e-16 ***
TMDB_Popularity 4.3485 0.7883 5.516 5.09e-08 ***
Movie_Length_Minutes 1.2657 0.6307 2.007 0.04522 *
TMDB_Rating 28.2402 9.7612 2.893 0.00395 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 199.7 on 619 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.487, Adjusted R-squared: 0.4837
F-statistic: 146.9 on 4 and 619 DF, p-value: < 2.2e-16
p <-ggplot(small_anim, aes(x = Budget_Million_USD, y = Box_Office_Million_USD, color = Animation_Style)) +labs(title ="Movie Budget vs. Box Office Revenue",caption ="Source: TMDB",x ="Budget",y ="Box Office",color ="Animation Style" )+theme_minimal() +theme(legend.position ="bottom" ) +scale_color_brewer(palette ="Set1") +geom_smooth(method ="lm", se =FALSE)p +geom_point(alpha =0.5, size =1)
`geom_smooth()` using formula = 'y ~ x'
At the end of your document, write a second brief essay (incorporated directly into your Markdown file). The essay should describe:
a. How you cleaned the dataset up (be detailed and specific, using proper terminology where appropriate).
I started by narrowing down the big dataset with 44 variables into a smaller one called small_anim so it would be easier to see what I was working with. I focused on things like the budget, box office revenue, and animation style. I did a lot of filtering and removed any movie titles that had NA values in it as well as budget, and box office. I also noticed that some moves had 0 dollars for their budget and box office. The filtering for NA didn’t catch that even though it is considered missing values. So, I decided to include only movies that had a budget and box office over 0.5 million dollars. It also got rid of any small studio movies that wouldn’t give much insight into the whole picture.
b. What the visualization represents, any interesting patterns or surprises that arise within the visualization.
The scatter plot that I created shows that there is a big connection between the budget and box office revenue. The higher the budget the higher the box office usually is. The trend line also makes that connection easier to see. The graph also shows that 3D CGI movies tend to make more than other styles. There are very few other styles that do make higher box office venue and it’s interesting to see that connection.
c. Anything that you might have shown that you could not get to work or that you wished you could have included.
I wanted to show what movie, a dot, came from what studio to see what studio usually makes the most or what studio seems to be continually making the most. Adding the release year could also help. However, all of that would be very clustered and make the scatter plot hard to read.