Data Dive Week 4

Setting up R and Loading Data set

First we bring in all the libraries we will be using. Then we load the data set we have downloaded.

#Load in Libraries
library(tidyr)
library(readr)
library(dplyr)
library(lubridate)
library(stringr)
library(janitor)
library(ggplot2)
library(scales)

#Load in the dataset
movies_raw <- read_csv("/Users/jus10segrest/Downloads/iu indy/stat for data science/movies.csv")

The next step for our data set is to clean it and format it so that we can begin to work through it.

#create a new table separating the released column into two release date/country
movies_ <- movies_raw |>
  separate(released, into = c("release_new","country_released"), sep=" \\(") |>
  mutate(country_released = str_remove(country_released, "\\)$")) |>    #remove the end parathensis
  mutate(release_date=mdy(release_new)) |>         #then change the date to an easier format
  rename(country_filmed=country)            #rename column for ease of understanding
  
movies_

## # A tibble: 7,668 × 17
##    name    rating genre  year release_new country_released score  votes director
##    <chr>   <chr>  <chr> <dbl> <chr>       <chr>            <dbl>  <dbl> <chr>   
##  1 The Sh… R      Drama  1980 June 13, 1… United States      8.4 9.27e5 Stanley…
##  2 The Bl… R      Adve…  1980 July 2, 19… United States      5.8 6.5 e4 Randal …
##  3 Star W… PG     Acti…  1980 June 20, 1… United States      8.7 1.20e6 Irvin K…
##  4 Airpla… PG     Come…  1980 July 2, 19… United States      7.7 2.21e5 Jim Abr…
##  5 Caddys… R      Come…  1980 July 25, 1… United States      7.3 1.08e5 Harold …
##  6 Friday… R      Horr…  1980 May 9, 1980 United States      6.4 1.23e5 Sean S.…
##  7 The Bl… R      Acti…  1980 June 20, 1… United States      7.9 1.88e5 John La…
##  8 Raging… R      Biog…  1980 December 1… United States      8.2 3.30e5 Martin …
##  9 Superm… PG     Acti…  1980 June 19, 1… United States      6.8 1.01e5 Richard…
## 10 The Lo… R      Biog…  1980 May 16, 19… United States      7   1   e4 Walter …
## # ℹ 7,658 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## #   budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## #   release_date <date>

Creating the Samples

We need to create the 5 samples of our data and we need to make sure that replacement is also TRUE

sample_1 <- sample_n(movies_,(7668/2),replace = TRUE)
print(sample_1)

## # A tibble: 3,834 × 17
##    name    rating genre  year release_new country_released score  votes director
##    <chr>   <chr>  <chr> <dbl> <chr>       <chr>            <dbl>  <dbl> <chr>   
##  1 Ghostb… PG     Acti…  1984 June 8, 19… United States      7.8 365000 Ivan Re…
##  2 Misery  R      Drama  1990 November 3… United States      7.8 191000 Rob Rei…
##  3 Lambada PG     Drama  1990 March 16, … United States      3.5    979 Joel Si…
##  4 Happy … PG     Anim…  2011 November 1… United States      5.9  43000 George …
##  5 On Che… R      Drama  2017 May 18, 20… United States      6.3  10000 Dominic…
##  6 Gettys… PG     Drama  1993 October 8,… United States      7.6  27000 Ron Max…
##  7 Second… PG     Come…  2003 September … United States      7.5  54000 Tim McC…
##  8 K-PAX   PG-13  Drama  2001 October 26… United States      7.4 177000 Iain So…
##  9 Money   <NA>   Crime  2019 March 22, … United States      5.9    630 Noo-ri …
## 10 High S… PG-13  Come…  1988 November 1… United States      5.8   8900 Neil Jo…
## # ℹ 3,824 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## #   budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## #   release_date <date>

sample_2 <- sample_n(movies_,(7668/2),replace = TRUE)
print(sample_2)

## # A tibble: 3,834 × 17
##    name    rating genre  year release_new country_released score  votes director
##    <chr>   <chr>  <chr> <dbl> <chr>       <chr>            <dbl>  <dbl> <chr>   
##  1 Bad Ti… R      Crime  2018 October 12… United States      7.1 135000 Drew Go…
##  2 Popsta… R      Come…  2016 June 3, 20… United States      6.7  57000 Akiva S…
##  3 The Bo… PG-13  Drama  2013 November 2… United States      7.5 130000 Brian P…
##  4 Blue    Not R… Biog…  1993 December 3… United States      7.3   2100 Derek J…
##  5 Barely… R      Come…  2003 May 25, 20… Thailand           4.7   5900 David M…
##  6 Rust a… R      Drama  2012 May 17, 20… Belgium            7.5  65000 Jacques…
##  7 Me and… R      Come…  2005 August 5, … United States      7.3  36000 Miranda…
##  8 Old Sc… R      Come…  2003 February 2… United States      7.1 219000 Todd Ph…
##  9 Kill Y… R      Biog…  2013 September … Croatia            6.5  36000 John Kr…
## 10 She's … R      Come…  2014 August 21,… United States      6.1  25000 Peter B…
## # ℹ 3,824 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## #   budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## #   release_date <date>

sample_3 <- sample_n(movies_,(7668/2),replace = TRUE)
print(sample_3)

## # A tibble: 3,834 × 17
##    name    rating genre  year release_new country_released score  votes director
##    <chr>   <chr>  <chr> <dbl> <chr>       <chr>            <dbl>  <dbl> <chr>   
##  1 Dead A… R      Come…  1992 February 1… United States      7.5  93000 Peter J…
##  2 Round … <NA>   Fami…  2019 June 21, 2… United States      4.6    150 Dylan T…
##  3 One Cr… PG     Come…  1986 August 8, … United States      6.4  14000 Savage …
##  4 Consta… R      Acti…  2005 February 1… United States      7   317000 Francis…
##  5 Tigerl… R      Drama  2000 May 24, 20… Germany            7    39000 Joel Sc…
##  6 Americ… R      Crime  2013 December 2… United States      7.2 458000 David O…
##  7 Beauty… PG-13  Come…  2005 March 30, … United States      5.6  17000 Bille W…
##  8 Red Li… R      Drama  2012 March 2, 2… Spain              6.2  59000 Rodrigo…
##  9 Gulliv… PG     Adve…  2010 December 2… United States      4.9  67000 Rob Let…
## 10 The Go… PG-13  Biog…  2014 November 7… South Africa       7.4  28000 Philipp…
## # ℹ 3,824 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## #   budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## #   release_date <date>

sample_4 <- sample_n(movies_,(7668/2),replace = TRUE)
print(sample_4)

## # A tibble: 3,834 × 17
##    name    rating genre  year release_new country_released score  votes director
##    <chr>   <chr>  <chr> <dbl> <chr>       <chr>            <dbl>  <dbl> <chr>   
##  1 My Fat… G      Adve…  1990 August 1991 United States      7.6   6200 Yves Ro…
##  2 Ghost … TV-MA  Anim…  1995 March 29, … United States      8   133000 Mamoru …
##  3 Orgazmo NC-17  Come…  1997 October 23… United States      6.1  35000 Trey Pa…
##  4 Fever … R      Come…  1997 April 4, 1… United Kingdom     6.7  10000 David E…
##  5 Thelma  Not R… Drama  2017 September … Norway             7    28000 Joachim…
##  6 I Am L… R      Drama  2009 July 23, 2… United States      7    21000 Luca Gu…
##  7 Godzil… PG-13  Acti…  2019 May 31, 20… United States      6   162000 Michael…
##  8 Blue C… PG-13  Drama  1994 February 1… United States      6.2  13000 William…
##  9 Sarafi… PG-13  Drama  1992 September … United States      6.3   1900 Darrell…
## 10 When M… PG     Anim…  2014 August 7, … United States      7.7  35000 James S…
## # ℹ 3,824 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## #   budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## #   release_date <date>

sample_5 <- sample_n(movies_,(7668/2),replace = TRUE)
print(sample_5)

## # A tibble: 3,834 × 17
##    name    rating genre  year release_new country_released score  votes director
##    <chr>   <chr>  <chr> <dbl> <chr>       <chr>            <dbl>  <dbl> <chr>   
##  1 Dreams… PG-13  Acti…  1984 August 17,… United States      6.3  14000 Joseph …
##  2 Frozen  PG     Anim…  2013 November 2… United States      7.4 585000 Chris B…
##  3 Whose … R      Come…  1981 January 22… United States      7.3   2700 John Ba…
##  4 The Am… PG-13  Acti…  2014 May 2, 2014 United States      6.6 427000 Marc We…
##  5 Charlo… G      Adve…  2006 December 1… United States      6.3  39000 Gary Wi…
##  6 The Mo… R      Acti…  1999 March 26, … United States      4.3   8800 Scott S…
##  7 Heaven… R      Drama  1984 February 1… United States      5.2    733 Lawrenc…
##  8 Scoop   PG-13  Come…  2006 July 28, 2… United States      6.6  82000 Woody A…
##  9 The Mi… R      Drama  2000 February 1… Germany            5.9  21000 Wim Wen…
## 10 What a… PG     Come…  2003 April 4, 2… United States      5.8  61000 Dennie …
## # ℹ 3,824 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## #   budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## #   release_date <date>

We can look through the data here but it is quicker and easier to look at it from summaries.

Analyzing the data

Here we set up the samples to give us the mean and median gross revenue along with the mean IMDB score for each movie.

sample_1 |>
  summarize(
    mean_gross = mean(gross, na.rm=TRUE),
    median_gross = median(gross, na.rm=TRUE),
    mean_score = mean(score, na.rm=TRUE)
  )

## # A tibble: 1 × 3
##   mean_gross median_gross mean_score
##        <dbl>        <dbl>      <dbl>
## 1  80947412.     19905359       6.37

sample_2 |>
  summarize(
    mean_gross = mean(gross, na.rm=TRUE),
    median_gross = median(gross, na.rm=TRUE),
    mean_score = mean(score, na.rm=TRUE)
  )

## # A tibble: 1 × 3
##   mean_gross median_gross mean_score
##        <dbl>        <dbl>      <dbl>
## 1  78548991.    19579034.       6.40

sample_3 |>
  summarize(
    mean_gross = mean(gross, na.rm=TRUE),
    median_gross = median(gross, na.rm=TRUE),
    mean_score = mean(score, na.rm=TRUE)
  )

## # A tibble: 1 × 3
##   mean_gross median_gross mean_score
##        <dbl>        <dbl>      <dbl>
## 1  79065400.     20016254       6.39

sample_4 |>
  summarize(
    mean_gross = mean(gross, na.rm=TRUE),
    median_gross = median(gross, na.rm=TRUE),
    mean_score = mean(score, na.rm=TRUE)
  )

## # A tibble: 1 × 3
##   mean_gross median_gross mean_score
##        <dbl>        <dbl>      <dbl>
## 1  79206218.    21245760.       6.40

sample_5 |>
  summarize(
    mean_gross = mean(gross, na.rm=TRUE),
    median_gross = median(gross, na.rm=TRUE),
    mean_score = mean(score, na.rm=TRUE)
  )

## # A tibble: 1 × 3
##   mean_gross median_gross mean_score
##        <dbl>        <dbl>      <dbl>
## 1  79614192.     21460601       6.39

Looking through the summaries we received it is obvious that there are many similarities between each sample. The mean score for each is in the range of 6.37-6.43 showing that on average no matter the sample movies in the data set have an average IMDB score of 6.4. Looking at the mean gross and median gross we see the same story, they fluctuate but only with 2-5% of each other showing that our samples are very similar.

For the graphs I decided to see how different the scatter plots of Gross Revenue and Budget would be for each sample. Would we see a negative trend line in one of the plots or perhaps another anomaly?

ggplot(sample_1, mapping=aes(x = budget, y = gross)) +
  geom_point(color = "black") +
  geom_smooth(method = "gam", color = "blue", se=FALSE) +
  geom_abline(slope=1, intercept=0, color="green", linetype="dashed") +
  scale_x_continuous(labels = label_dollar()) +
  scale_y_continuous(labels = label_dollar()) +
  labs(
    title = "Scatterplot of Gross Revenue vs Budget for Sample 1",
    x = "Budget",
    y = "Gross Revenue",
    subtitle = "Above the dashed line: Profitable | Below the dashed line: Loss"
  ) +
  theme_minimal()

ggplot(sample_2, mapping=aes(x = budget, y = gross)) +
  geom_point(color = "black") +
  geom_smooth(method = "gam", color = "blue", se=FALSE) +
  geom_abline(slope=1, intercept=0, color="green", linetype="dashed") +
  scale_x_continuous(labels = label_dollar()) +
  scale_y_continuous(labels = label_dollar()) +
  labs(
    title = "Scatterplot of Gross Revenue vs Budget for Sample 2",
    x = "Budget",
    y = "Gross Revenue",
    subtitle = "Above the dashed line: Profitable | Below the dashed line: Loss"
  ) +
  theme_minimal()

ggplot(sample_3, mapping=aes(x = budget, y = gross)) +
  geom_point(color = "black") +
  geom_smooth(method = "gam", color = "blue", se=FALSE) +
  geom_abline(slope=1, intercept=0, color="green", linetype="dashed") +
  scale_x_continuous(labels = label_dollar()) +
  scale_y_continuous(labels = label_dollar()) +
  labs(
    title = "Scatterplot of Gross Revenue vs Budget for Sample 3",
    x = "Budget",
    y = "Gross Revenue",
    subtitle = "Above the dashed line: Profitable | Below the dashed line: Loss"
  ) +
  theme_minimal()

ggplot(sample_4, mapping=aes(x = budget, y = gross)) +
  geom_point(color = "black") +
  geom_smooth(method = "gam", color = "blue", se=FALSE) +
  geom_abline(slope=1, intercept=0, color="green", linetype="dashed") +
  scale_x_continuous(labels = label_dollar()) +
  scale_y_continuous(labels = label_dollar()) +
  labs(
    title = "Scatterplot of Gross Revenue vs Budget for Sample 4",
    x = "Budget",
    y = "Gross Revenue",
    subtitle = "Above the dashed line: Profitable | Below the dashed line: Loss"
  ) +
  theme_minimal()

ggplot(sample_5, mapping=aes(x = budget, y = gross)) +
  geom_point(color = "black") +
  geom_smooth(method = "gam", color = "blue", se=FALSE) +
  geom_abline(slope=1, intercept=0, color="green", linetype="dashed") +
  scale_x_continuous(labels = label_dollar()) +
  scale_y_continuous(labels = label_dollar()) +
  labs(
    title = "Scatterplot of Gross Revenue vs Budget for Sample 5",
    x = "Budget",
    y = "Gross Revenue",
    subtitle = "Above the dashed line: Profitable | Below the dashed line: Loss"
  ) +
  theme_minimal()

Comparing the scatter plots of Budget vs Revenue for these 5 samples, we see a lot of similarities. The trend line in all 5 has a positive direction showing that budget and revenue have a positive correlation and relationship. We can see a difference in the blue trend line in sample 3 where it seems to not have as much of an exponential curve past budgets of $200 million. Could this mean that our big budget movies or highest grossing movies are big outliers compared to the rest of our data?

I think this investigation shows that this data set has enough values in it to allow for a pretty good analysis of the data and extrapolating it into real life. By not having much variation or anomalies it shows that the data set is consistent across the values in it. If we want to investigate more about the data, the categorical variables could be better things to analyze in the future.