Week 5 Data Dive - Documentation

This week we are going to look more deeply at the data documentation and identify potential issues and challenges.

For reference here is the link to the original data set and the documentation from kaggle.com.

Box Office Data 2000-2024: Documentation

Some comments on the documentation:

The purpose and method of data collection is clearly defined within the documentation. The column values are slightly hidden and are only visible when examing the detail section of the .csv file.
3 columns include combined data, Genres, Rating and Production Countries. Further discussion regarding these columns follows below in the task demonstration section.

For clarity the column definitions are copied here:

Data Type	Column Name	Definition
integer	Rank	The global ranking of the movie based on worldwide box office revenue.
string	Release Group	The title of the movie or film series.
decimal	$Worldwide	Total worldwide box office revenue for the movie
decimal	$Domestic	Total domestic (local) box office revenue.
decimal	Domestic %	Percentage of worldwide revenue generated domestically.
decimal	$Foreign	Total foreign (international) box office revenue.
decimal	Foreign %	Percentage of worldwide revenue generated in foreign markets.
integer	Year	The release year of the movie.
string	Genres	The genres associated with the movie (e.g., Action, Drama, Comedy).
string	Rating	Average audience rating from TMDB (e.g., 7.5/10).
decimal	Vote_Count	The number of votes the movie received on TMDB.
string	Original_Language	The primary language of the movie.
string	Production_Countries	The countries where the movie was produced.

Initialization Step 1 - Load Libraries

#load the data libraries - remove or add as needed
library(tidyverse)   #tools form data science, included ggplot2, dplyr, tidyr, readr, tibble, stringr, and forcats as core libraries.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(scales)      #loaded to address viz issues, including currency issues

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

options(scipen=999)  #disable scientific notation since high values are used

Initialization Step 2 - Load Initial Data Frames

#load the adjusted version of the csv from the local desktop
t_box_office <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv", delim = ",")

## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Insights

I like having the task instructions in this document, however for readablility, I have decide to move them down into the task demonstration section of the document as block text.

Task Demonstrations

Task 1 - review 3 columns which are unclear until reading documentation

A list of at least 3 columns (or values) in your data which are unclear until you read the documentation.

E.g., this could be a column name, or just some value inside a cell of your data

Why do you think they chose to encode the data the way they did? What could have happened if you didn’t read the documentation?

Rating

When speaking of movies the term ‘rating’ has a clear meaning. A rating is the official code added to a movie to help viewers and theaters get a general understanding of the movies content and language. In this case it is easy to expect that ‘rating’ would be the rating code. In this case this is a dangerous assumption because a review of the documentation shows this is not correct. The only clue in this case was that the work ‘rating’ is somewhat ambiguous, and a quick glance at the data content suggested this column was not representing the typical usage of the term ‘rating’ where movies are concerned.

The documentation reveal this is the score of the movie as found on TMDB (The Movie Database, themoviedb.org)

Insight - This column shows the importance of reviewing data and it’s structure early in analysis. In this case several major issues could have occurred. One, any categorical information that might be gained off the G, PG, PG-13, R, rating system would fail to work. It is worth considering that the author may have avoided the use of the ratings code as they did not want to delve into the differences between countries. The second issue with this is the actual content of the field. the values are in a text format of #/10. In this format the score cannot be analyzed or compared with others.

Looking at TMDB, it is clear the author name this value Rating as that is how that site collects user scores.

Release Group

This column name is strange. What is a release group? By title alone it is unclear what is intended by this column. It is easy to imagine it has something to do with the seasonal release, such as 1st quarter, or general groups like chiristmas, summer, spring, award season, possibly it’s related to some factor involving the production or distribution companies.

The documentation reveals this is the title of the movie or series.

Insight - The author may have been imagining one of the other interpretations when they were collecting the data. More likely it seems they were trying to provide some manner of adjusting movie titles that were part of a series or franchise. Inspection of the actual data in the column makes it is easy to discern that this is the movie title, so it is unlikely that many problems would occur due to misuse. The real danger lies in expecting that this data was actually intended for some purpose that is still unclear to the user, merely based on how the column was titled.

Genres

Why is the column named genres, plural? This column requres a look at the documentation for a couple reasons. It seems likely that genre is going to be what general category does the movie belong to, but why, again, is the column name plural? What are considered genres? Are they named consistently? Is this data set listing all possible genres a movie can belong to in one column?

The answer turns out to be yes.

Insight - Without understanding that the author was grouping the genres the analysis could break down or get extremely complicated requiring search strings, and breakdowns. Knowing how the author reported the data made it clear that a data transformation would be extremely useful before delving too deeply into the data. In this case it also required a little analysis to determine if the first item in each list could be assumed to be the primary genre. review of the data shows that the genre doesn’t appear in a consistent order suggesting that the first listed is the most important. The challenge here is that this is an assumption that the data documentation does not discuss. It is, therefore, important that all analysis refering to genre be flagged with a note letting the reviewer understand this interpretation of the data. In other words this is a bias that should be reported.

Task 2 - Review at least 1 column that is still unclear after reading the documentation

At least one element or your data that is unclear even after reading the documentation

You may need to do some digging, but is there anything about the data that your documentation does not explain?

Production Countries

The Production Countries column is very similar to the Genres column in that it has the same initial problem. Why is it being referred to as plural? the same problems are here. If multiple countries were involved in the production of a film, how is it determined which if any of the countries.

Sure the documentation says “The countries where the movie was produced”, but what does this mean? There were multiple production companies involved with ownership by different countries? The movie was produced with specific countries targeted for release? The documentation doesn’t explicitly explain this discrepancy. Further, in this case, all production countries seem to be listed alphabetically. If we wish to assume that the primary country is the first one listed, the data doesn’t seem to support this. Even after reading the documentation the true purpose of the Production coutries column is still ambiguous.

Insight - This is another case where an assumption must be made that can introduce bias. The presentation of the data science must consider the ethical implications of this data ambiguity. Of course, the bias should be represented and reported, but how much strength should be given to any analysis that suggests insight regarding Production Countries? For a lightweight analysis project such as this, that just goes to show valid analysis techniques, this is not a significant issue as long as the bias is communicated to the end user. If however this was a paid and/or official analysis with the intention of making decisions regarding policies, funds or reporting on actual trends, the data scientist should consider excluding these columns or doing further work to get the data accurate and representative, before referencing them.

What is missing? What does the data not explain?

This data set is interesting in the case of what it excludes. Why aren’t the rating codes included? Why isn’t there budget information that might help analyse the factor by which a movie is considered successful? When during the year was the movie released? What was the length of the movies run? Why did the author select the top 200 movies for each year since 2000? What was the criteria used to determine that the movie qualified as a top 200? Without these items included or explained, the data set has a relatively limited possible scope of analysis. That’s not necessarily bad, but it is critical to understand before embarking on an analysis that might no be supported by this data set.

Task 3 - Build at least 2 visualizations

Build at least two visualizations which use a column of data that is affected by the issue you brought up in bullet #2, above. In these visualizations, find a way to highlight the issue from different perspectives, explain what is unclear, and why it might be concerning.

You can use color or a text label, but also make sure to explain your thoughts using Markdown.
Do you notice any significant risks? If so, what could you do to reduce negative consequences?

For these visualizations we’ll look at the Production Countries against $worldwide earnings

Viz #1

First we’ll set up the data to look at worldwide earnings

# Summarize total worldwide earnings and count of movies by production country
country_summary <- t_box_office |>
  group_by(Prime_Production_Country) |>
  summarise(
    total_worldwide = sum(`$Worldwide`, na.rm = TRUE),
    movie_count = n()
  ) |>
  arrange(desc(total_worldwide))

We’ll filter the data to the top 10 countires by earnings and create the plot…

#create the plot
# Filter top 10 production countries by total worldwide earnings
top_10_countries <- country_summary |>
  slice_max(total_worldwide, 
            n = 10)

# Plot total worldwide earnings for top 10 production countries
ggplot(top_10_countries, 
       aes(x = reorder(Prime_Production_Country, 
                       -total_worldwide), 
           y = total_worldwide)) +
  geom_bar(stat = "identity", 
           fill = "skyblue") +
  coord_flip() +
  labs(title = "Top 10 Production Countries by Total Worldwide Earnings",
       x = "Production Country",
       y = "Total Worldwide Earnings") +
  theme_classic()

Insights

The challenge with this graph is we are forced to make an assumption that the first country listed in the Prime_Production_Country column is indeed the primary production country.

This problem might be addressed by adding a flag for Multi country productions vs single country productions.

Viz #2

Next we’ll look at the same data but we’ll filter it to the top 10 values alphabetically….

#get the data
country_summary <- t_box_office |>
  group_by(Prime_Production_Country) |>
  summarise(
    total_worldwide = sum(`$Worldwide`, 
                          na.rm = TRUE),
    movie_count = n()
  ) |>
  arrange(Prime_Production_Country)

# Filter the first 10 production countries alphabetically
first_10_countries <- country_summary |>
  slice_head(n = 10)

then build a plot to see the results…

#create the plot
# Plot total worldwide earnings for the first 10 production countries alphabetically
ggplot(first_10_countries, 
       aes(x = reorder(Prime_Production_Country, 
                       -total_worldwide), 
           y = total_worldwide)) +
  geom_bar(stat = "identity", 
           fill = "skyblue") +
  coord_flip() +
  labs(title = "First 10 Production Countries \nAlphabetically by Total Worldwide Earnings",
       x = "Production Country",
       y = "Total Worldwide Earnings") +
  theme_minimal()

Insights

Here is the same data, but using the top 10 countries alphabetically. It is possible that these numbers are inflated just because they would appear first. To address this I would consider adding a column to the data set that shows the number of countries listed in production countries and do some analysis on movies with one production country vs mulitiple countries.

Additionally I might consider creating a new data set that adds rows for each different production country, but add a number that specifies a, how many countries were associated with the film and b, what position was that country in the countries list. This would create instances where a single film would have multiple rows so care should be taken to avoid inflating numbers artificially by counting a movie multiple times. The advantage would be that deeper analysis could be performed on productiion countries. How well did films do when produced with other countries? Are any combinations extremely successful? What are the most common groupings? Does for example sharing production credit with the US increase the potential success of a film?

Task 4 - For 2 categorical columns look for explicit and implicitly missing rows and empty groups

For at least two categorical columns, check for examples of the following, and describe what you find:

Are there explicitly missing rows?

Are there implicitly missing rows?

Are there empty groups?

For this one, we’ll put all the code in one block.
We’re going to look at each year of movies and see which ones don’t have scores…

#get the data
no_rating <- t_box_office |>
  group_by(Year) |>
  summarise(
    missing = sum(is.na(Rating_of_10)),
    not_missing = sum(!is.na(Rating_of_10))
  ) |>
  pivot_longer(cols = c(missing, 
                        not_missing), 
               names_to = "status", 
               values_to =  "count")

#create the plot
ggplot(no_rating,
       aes(x = Year,
           y = count,
           fill = status))+
  geom_bar(stat="identity",
           position = "stack") +
  labs(title = "Top 200 movies each year without a score",
       x= "Genre",
       y= "Number of Scores",
       fill = "Score Status") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust =1)) +
  scale_x_continuous(breaks = seq(min(no_rating$Year), max(no_rating$Year), by = 1))

Insight

This does show that most years there are missing scores for some of the movies. This might become an issue if we’re trying to compare average ratings for each year. While there are 200 movies listed for each year the scale for each year is slightly different and it could introduce a slight bias. For example 2010 and 2020 have fewer scores than the other years even though the number of overall movies is the same.

#get the data
no_genre <- t_box_office |>
  group_by(Year) |>
  summarise(
    missing = sum(is.na(Prime_Genre)),
    not_missing = sum(!is.na(Prime_Genre))
  ) |>
  pivot_longer(cols = c(missing, 
                        not_missing), 
               names_to = "status", 
               values_to =  "count")

# Filter data to include only missing values and countries with at least one missing value
filtered_missing_data <- no_genre %>%
  filter(status == "missing" & count > 0)

# Create the bar chart with rotated y-axis labels
ggplot(filtered_missing_data, 
       aes(x = Year, 
           y = count, 
           fill = status)) +
  geom_bar(stat = "identity", 
           position = "stack") +
  labs(title = "# of Movies without a genre by Year",
       x = "Country",
       y = "Number of Missing Genres",
       fill = "Genre Status") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, 
                                   hjust = 1)) +
  scale_x_continuous(breaks = seq(min(no_rating$Year), 
                                  max(no_rating$Year), 
                                  by = 1))

Insight

The above plots are both illustrations of explicitly missing data. Both illustrations show the number of rows of data that are missing.

If, for example the “Top 200 movies each year without a score” graph did not include a any values for 2007, this would be an example of an implictly missing set of data. In this case the data does not appear to be implicitly missing data. Additionally, if we expected to see budget data in the data set but it was only included for a couple of the years, this also would qualify as explicitly missing data, but as theat data is not in the data set at all it is just absent. I would suggest that implicitly missing data is typically a problem when joining related data sets. If the data capture rules were different between sets it is possible to create a situation where expected data is incomplete.

#create the data
# Generate all possible combinations of years and genres
years <- unique(t_box_office$Year)
genres <- unique(t_box_office$Prime_Genre)
all_combinations <- expand.grid(Year = years, 
                                Genre = genres)

# Summarize your existing data
movies_summary <- t_box_office |>
  group_by(Year, Prime_Genre) |>
  summarise(movie_count = n(), 
            .groups = 'drop') |>
  rename(Genre = Prime_Genre)

# Left join all combinations with the existing data
#same principal as sql it looks like
full_data <- all_combinations |>
  left_join(movies_summary, 
            by = c("Year", 
                   "Genre"))

# Identify the combinations with no movies
no_movies <- full_data |>
  filter(is.na(movie_count))

# Create the bar chart
ggplot(no_movies, 
       aes(x = Year, 
           y = Genre)) +
  geom_tile(aes(fill = factor(movie_count)), 
            color = "white") +
  scale_fill_manual(values = c("NA" = "red"), 
                    labels = c("NA" = "Missing")) +
  labs(title = "Missing Movies by Year and Genre",
       x = "Year",
       y = "Genre",
       fill = "Status") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, 
                                   hjust = 1)) +
    scale_x_continuous(breaks = seq(min(no_rating$Year), 
                                    max(no_rating$Year), 
                                    by = 1))

Insight

This tile chart shows the six genres that did not have movies represented in the top 200 in various years. I would classify these as empty groups as the combinations of year and genre doesn’t return any results in several cases. This is actually a very interesting chart in that it clearly shows that Westerns aren’t well represented in the top 200, but every year but 2012 included a Mystery movie in the top 200. Surprising also, is how well represented Documentaries are except for in 2019 and 2021.

Task 5 - For at least one continuous column, define and outlier and explain

For at least one continuous column, what would you define as an outlier, and why?

I am looking at the $Worldwide column which shows the worldwide income of all movies in the top 200 between 2000 and 2024.

here’s the data…

#rename the df
df_3 <- t_box_office
# Calculate box plot statistics for the $Worldwide column
box_stats <- boxplot.stats(df_3$`$Worldwide`)

# Get the number of outliers
num_outliers <- length(box_stats$out)

# Get the highest and lowest outliers
highest_outlier <- max(box_stats$out, na.rm = TRUE)
lowest_outlier <- min(box_stats$out, na.rm = TRUE)

# Calculate the number of data points within the box
num_in_box <- nrow(df_3) - num_outliers

avg_income <- mean(df_3$`$Worldwide`, na.rm = TRUE)

and now we create a box plot to look at the data…

ggplot(df_3, 
       aes(y = `$Worldwide`)) +  # Use backticks for column names with special characters
  geom_boxplot(fill = "lightblue", 
               color = "darkblue", 
               outlier.color = "red", 
               outlier.size = 2, 
               width = 0.7) +  # Customize box plot appearance
  theme_classic() +
  labs(title = "Box Plot of Worldwide Income",
       y = "Worldwide Income") +
  geom_hline(yintercept = avg_income, 
             linetype = "dashed",
             color = "red") +
  annotate("text",
           x = 1,  # Move text to a centered position
           y = avg_income,
           label = paste("Average Income:", avg_income), 
           vjust = -0.5, 
           hjust = 1, 
           color = "red") +
  annotate("text",
           x = 1,  # Move text to a centered position
           y = max(box_stats$stats), 
           label = paste("Outliers:", num_outliers), 
           vjust = -12.5, 
           hjust = 1, 
           color = "red") +
  annotate("text",
           x = 1,  # Move text to a centered position
           y = min(box_stats$stats), 
           label = paste("In Box:", num_in_box), 
           vjust = 1, 
           hjust = 1, 
           color = "blue") +
  annotate("text",
           x = 1,  # Move text to a centered position
           y = highest_outlier, 
           label = paste("Highest Outlier:", highest_outlier), 
           vjust = 0, 
           hjust = 2, 
           color = "purple") +
  annotate("text",
           x = 1,  # Move text to a centered position
           y = lowest_outlier, 
           label = paste("Lowest Outlier:", lowest_outlier), 
           vjust = 0, 
           hjust = 2, 
           color = "purple") +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),  # Enhance the plot title
    axis.title.y = element_text(size = 12, face = "bold")              # Enhance the y-axis title
  )

Insights

The interesting thing about this box plot is the very tight distribution around the average. 4447 Movies fall inside the box and 553 are greater than 1.5 times this range. That’s 0.12% of the movies make more that $262 Million with the largest money maker making $2.79B This is an incredibly wide range. If we are trying to identify typical average income by well rated movies, these outliers, which represent unusually high income, could significantly skew the results.

This box plot does however bring up some interesting questions that might be worth examining.

What is the makeup of those 553 outliers, what genre, what country? How does the performance of those outliers compare to the performance of the non-outliers of the same genre or country? How does the box plot change if the outliers are excluded?

General coding insights

Something I didn’t expect with R studio, that I’ve grown to like. It is often difficult to indent code cleanly to show the various attributes together. With R studio this is handled very cleanly. Hitting the enter key after a comma indents the line in accordance with the code group it belongs to. I find this very helpful as it brings better clarity to what attributes are part of the same function.

H510_DJH_wk5_DataDive_v01

DJH

2025-02-17

Week 5 Data Dive - Documentation

Initialization Step 1 - Load Libraries

Initialization Step 2 - Load Initial Data Frames

Insights

Task Demonstrations

Task 1 - review 3 columns which are unclear until reading documentation

Rating

Release Group

Genres

Task 2 - Review at least 1 column that is still unclear after reading the documentation

Production Countries

What is missing? What does the data not explain?

Task 3 - Build at least 2 visualizations

Viz #1

Insights

Viz #2

Insights

Task 4 - For 2 categorical columns look for explicit and implicitly missing rows and empty groups

Insight

Insight

Insight

Task 5 - For at least one continuous column, define and outlier and explain

Insights

General coding insights