Setting up R and Loading Data set

First we bring in all the libraries we will be using. Then we load the data set we have downloaded.

#Load in Libraries
library(tidyr)
library(readr)
library(dplyr)
library(lubridate)
library(stringr)
library(janitor)
library(ggplot2)
library(scales)

#Load in the dataset
movies_raw <- read_csv("/Users/jus10segrest/Downloads/iu indy/stat for data science/movies.csv")

The next step for our data set is to clean it and format it so that we can begin to work through it.

#create a new table separating the released column into two release date/country
movies_ <- movies_raw |>
  separate(released, into = c("release_new","country_released"), sep=" \\(") |>
  mutate(country_released = str_remove(country_released, "\\)$")) |>    #remove the end parathensis
  mutate(release_date=mdy(release_new)) |>         #then change the date to an easier format
  rename(country_filmed=country)            #rename column for ease of understanding
  
movies_
## # A tibble: 7,668 Ă— 17
##    name    rating genre  year release_new country_released score  votes director
##    <chr>   <chr>  <chr> <dbl> <chr>       <chr>            <dbl>  <dbl> <chr>   
##  1 The Sh… R      Drama  1980 June 13, 1… United States      8.4 9.27e5 Stanley…
##  2 The Bl… R      Adve…  1980 July 2, 19… United States      5.8 6.5 e4 Randal …
##  3 Star W… PG     Acti…  1980 June 20, 1… United States      8.7 1.20e6 Irvin K…
##  4 Airpla… PG     Come…  1980 July 2, 19… United States      7.7 2.21e5 Jim Abr…
##  5 Caddys… R      Come…  1980 July 25, 1… United States      7.3 1.08e5 Harold …
##  6 Friday… R      Horr…  1980 May 9, 1980 United States      6.4 1.23e5 Sean S.…
##  7 The Bl… R      Acti…  1980 June 20, 1… United States      7.9 1.88e5 John La…
##  8 Raging… R      Biog…  1980 December 1… United States      8.2 3.30e5 Martin …
##  9 Superm… PG     Acti…  1980 June 19, 1… United States      6.8 1.01e5 Richard…
## 10 The Lo… R      Biog…  1980 May 16, 19… United States      7   1   e4 Walter …
## # ℹ 7,658 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## #   budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## #   release_date <date>

3 Unclear Columns

The first column I found that was unclear at first until I read the documentation was “score”. I was curious what is the score referring to? Is it IMDB score, rotten tomatoes, or a random movie sites collective score of the movies in the data set? The actual meaning when looking into the data was that is from IMDB and it’s the average user rating. Users on IMDB can make an account and then rate movies on the website and the score column is the average score of all of those ratings (indicated by the votes column).

The second column I looked at was “gross”. This one could have meant many different things but after looking at the documentation it is the gross revenue of the listed movie. Another question I had was if it accounts for global revenue or just revenue. Although this wasn’t listed in the documentation after looking in the comments I found that the author of the data set said it is just the revenue from the United States.

The last column I looked at was “country_filmed”. I was curious if this indicated where the majority of the film was shot, or was it just the country it was produced in. The documentation just says “country of origin,” I believe this must be then where it was produced and mainly worked out of. Even if they might have filmed in another location mainly, if it was a US based company it would say US.

Still Unclear Column

I found that the “released” column does not have very much information about it which can lead to some potential questions. In the raw data it is listed as “Month Day, Year (country released in), which I then turned into two separate columns of release-date (YYYY-MM-DD format) and released_country which shows which country it was released in. This still has some problems though to me which aren’t fully explained. Were the movies eventually released In the USA? If they never were released in the USA but say a foreign country in the”released_country” column, does that mean that all of those rows have NA in gross revenue? Why do some USA releases also have NA in their gross column if we only need USA information for the gross column.

Two Visualizations of the Released Column

The first visualization I wanted to show was how many US released movies are missing gross revenue data to see if it was a large amount or perhaps it is just an issue of available public data on certain movies

# Filter for movies released in the USA
usa_movies <- movies_ %>%
  filter(country_released == "United States") %>%
  mutate(missing_gross = ifelse(is.na(gross), "Missing", "Present"))

# Count the missing vs. present values
usa_gross_counts <- usa_movies %>%
  group_by(missing_gross) %>%
  summarise(count = n())

# Create the bar chart
ggplot(usa_gross_counts, aes(x = missing_gross, y = count, fill = missing_gross)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = count), vjust = -0.5, size = 3) +
  labs(
    title = "Count of U.S. Movies with Missing vs. Present Gross Revenue",
    x = "Gross Revenue Data",
    y = "Number of Movies"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Looking at this graph we see that 132 out of the 6735 US movies have NA in the gross revenue column. I would say this indicates an issue of publicily available information of those movies, rather than the data set having an issue. We know that we have a total of 7668 rows of movie data, meaning that there are 933 rows of data that contain not US released movies.

For the next visualization I wanted to look at foreign countries and see if they had NA’s in their data.

#Add a new column with values Missing and Present for NA values in column gross
movies_$gross_missing <- ifelse(is.na(movies_$gross), "Missing", "Present")

#Filter out US and countries that have no NA's 
movies_filtered <- movies_ %>%
  filter(country_released != "United States") %>%
  group_by(country_released) %>%
  filter(any(is.na(gross))) %>%
  ungroup()

#Create a stacked bar chart of the data
ggplot(movies_filtered, mapping = aes(x = country_released)) +
  geom_bar(mapping = aes(fill = gross_missing)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 0.8, size = 7))

In this graph we are able to see that there are NA values with other released countries as well. This raises a big question about the data. If these movies are released in other countries but the gross revenue column only takes into account US revenue, why are these not all NA? I think the most likely answer is that these movies were orginially released in the countries listed in the country_released column on the date listed, but were eventually released in the US. This would be an important number to know or find in the future as we could investigate how big of an effect releasing in other countries first has on US gross revenue.

Implicit and Explicit Categorical Columns

The first column I looked at was rating and I found a lot of NA values, which I believe are explicit types. There is no reason to have NA values for the rating column as all movies need to be rated before they are released. This could be due to low public information on those specific movies. It does not contain any implicit or empty row type of missing data.

The other column I decided to look at was a collection of Star, Writer and Director. Director had 0 missing data, Writer had 3 NAs, and Star had 1. This makes sense because these are three of the easiest things to find about ANY movie. My suspicion with writer could be that there was never a lead writer and it somehow messed with the data set the author created. I also think this could be another case of publicly missing data. There is no missing implicit data or empty groups for these categories as well.

Continuous Column Outlier - Budget

I chose this column as I was curious if there were any movies that simply blew every other movie out of the water with its budget.

ggplot(movies_, mapping = aes(y=budget)) +
  scale_y_continuous(labels = label_dollar()) +
  labs(
    title = "Boxplot of Budget Column with Outliers",
    y = "Budget",
  ) +
  geom_boxplot(outlier.color = "red")

In this graph we can see that outliers begin at movies over $100 million dollars. This makes complete sense in terms of the data set as movie producing companies aren’t willing to give out that much money often, as well as there not being that many movie companies that could fund a movie with that budget.