Week 3 Data Dive

Data Dive Tasks:

At least 3 “group by” data frames, and an investigation into each. You’ll need to use categorical columns, or one of the cut_ functions here
- Use the group_by function to group your data into (at least) 3 different sets of groups, each summarizing different variables.
- For example, this could be as simple as three data frames which group your data based on three different categorical columns, but summarize the same continuous column. Or, it could be as complex as three different combinations of categorical columns, each illustrating summarizations of different continuous (or categorical columns).
- Each group in a group_by dataframe will have a number of rows associated with it (e.g., if you only group by a single column, then this is the result of count). So, if we were to randomly select a row from your dataset, the smallest groups have a lowest probability of being selected.
- Assign the lowest probability group(s) a special tag, and then translate that back into the context of your data. Draw some conclusions about the groups you’ve found. (I.e., in other words, what does it mean that Group X is the smallest? Can you phrase this in terms of probability?)
- Draw a testable hypothesis for why some groups are rarer than others (i.e., something quantifiable).
- Build at least one visualization for each of these three groupings.
Pick two categorical variables, and build a data frame of all their combinations (i.e., the unique rows among the two columns).
- (Only if it exists…) Find a combination that does not exist in the data. (For example, if you have a column for “color” and a column for “size”, you might be missing a row that is both “Blue” and “Medium”.) Why do you think these are missing?
- Which combinations are the most/least common, and why might that be? Hint: for this, you’ll need to use count or group_by.
- Find a way to visualize at least one of the combinations.

*explain what insight was gained.

Initialization Step 1 - Load Libraries

#load the data libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)
library(scales)      #loaded to address currency issues

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

options(scipen=999)  #disable scientific notation since high values are used

Initialization Step 2 - Load data set

*Using an adjusted version of the enhanced_box_office_data(2000-2024)u.csv which has the following column changes to support cleaner coding and analyses. Changes listed here were made directly in the csv source file.

prime_genre was added to show the primary genre selected for each film. The initial value in the comma separated data was defaulted as the primary genre. Data review shows genres were not always the same place in the order suggesting a hierarchy.
prime_production_company was added to show the primary production country for each film. It is unclear through the data if there is any precedence as countries are most frequently listed alphabetically. For the purpose of this review the intial value in the comma separated data was selected as the primary.
rating_of_10 was added to represent the rating in a format that allows further calculation. The original format of x.xx/10 was converted to x.xx after all values were confirmed to be on the same scale of 10
rating_scale was added to confirm that all ratings are on the same scale. Review shows all values are either 10 or null confirming that all non null instances of rating_of_10 can be measured against each other as they match the same scale.

*All further data dives will use the same adjusted data set and will not explicitly mention these adjustments. These adjustments will be called out in the final project for publication.

#load the adjusted version of the csv with the changes from above
t_box_office <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv", delim = ",")

## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Insights

In working on this weeks assigments some attention was brought to the idea of when in the code should data management occur. I considered adding columns to the primary data frame in the code block above so those changes would be available in the tasks for the week. I chose not to do this in favor of adding changes to the data frame in conjunction with the task being performed to make the activities follow one behind the other logically instead of requiring th reviewer to scroll back to the top for clarity.

Task demonstrations

For these tasks I am following pattern:

Set up the requested data frame, including any code required to update the primary data frame with support information.
Create a visualization that explores the data.
Provide Insights or hypotheses

Task 1 - Build 3 Group by data frames

At least 3 “group by” data frames, and an investigation into each. You’ll need to use categorical columns, or one of the cut_ functions here

For these tests it is important to recognize the scope of this data set. The data set represents Ratings and earnings for the top 200 rated movies each year between 2000 and 2024.

Group By #1 - What is the average revenue by each genre?

For this task I am interested in what the average revenue is for each genre. As this is a large data set I am limiting the results to the top 8 average genres.

Create the data set

#Create a summary data set of ww revenue by Genre
Worldwide_rev_by_genre <- t_box_office |>                  #create a new df                         
  group_by(Prime_Genre) |>                                 #group the data by genre
  summarise(                                               #add summary data to the df
    avg_ww_revenue = mean(`$Worldwide`, na.rm = TRUE),     #add a column for the mean of the $Worldwide by genre
    count = n()                                            #add a column for the number of rows in each genre
    ) |>
  arrange((count))                                         #sorts the data in ascending order by default
    
#Worldwide_rev_by_genre

Create a visualization

#Select the top 8 genres by average worldwide revenue
#Filter the data set to the top * values for better visualization
top_8_genres <- Worldwide_rev_by_genre |>                #create a new df
  arrange(desc(avg_ww_revenue)) |>                       #arrange the data in descending order
  slice(1:8)                                             #pulls the first 8 rows, works because the data is ordered

# Create the bar chart
base_plot <- ggplot(top_8_genres, 
                    aes(x = reorder(Prime_Genre,         #x is the reordered version of genre
                                    -avg_ww_revenue),    #order in descending order by the avg revenue
                        y = avg_ww_revenue,              #y is the avg revenue
                        fill = Prime_Genre)) +           #bars are categorized by genre color
  geom_bar(stat = "identity") +                          #height is determined by data vals
  scale_y_continuous(labels = dollar_format()) +         #formatting as currency
  labs(title = "Average Worldwide Value by Genre (Top 8 Genres)",
       subtitle = "Bar chart showing the average worldwide value for the top 8 genres in descending order",
       x = "Genre",
       y = "Average Worldwide Value") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Dark2", name = "Genre")    #use color brewer Dark2

base_plot

Which group has the lowest probability? The graph shows “Western”.

Adding a column to this data set that tests if a row is Western.

#update the data frame to add a column to represent the the rows which are 'Westerns'
t_box_office <- t_box_office |>
  mutate(IsWestern = Prime_Genre == "Western")            #add a boolean column called IsWestern
#view(t_box_office)

Adding an annotation to the graph that calculates the probability that a selected movie is a Western.

# Check if 'Iswestern' column exists and has valid data
if ("IsWestern" %in% colnames(t_box_office) && !all(is.na(t_box_office$IsWestern))) {    #makes sure all rows are not NA and IsWestern is a valid name
  prob_western <- round(sum(t_box_office$IsWestern,                                      
                            na.rm = TRUE) / nrow(t_box_office) * 100, 2)
} else {
  prob_western <- NA
}

# Create the annotated plot
annotated_plot <- base_plot +
  annotate("text", 
           x = 3, y = max(top_8_genres$avg_ww_revenue) * 0.95,                           #where will the annotation be placed?  95% of the Max value
           label = paste("The probability that a selected \nmovie is a Western is", prob_western, "%"),   #uses the calculated probability for Western
           size = 5, 
           color = "blue", 
           hjust = 0)

# Display the plot with the annotation
annotated_plot

Insights

Fought this code. for some reason, simpler versions of the code were having trouble recognizing the IsWestern column. Had to add the if else to keep the code from breaking. Best guess, is there was a capitalization issue that was hard to discern due to the font.

What is a testable hypothesis?

The average worldwide value of the genre Western is dis-proportionate to the number or Westerns that made it into the top 200.

Group By # 2 - What is the probability that a country has less than 10 movies in the top 200 since 2000?

Create the data set

#add a column to the t_box_office data frame to flag countries with less than 10 movies in the df.
t_box_office <- t_box_office %>%
  group_by(Prime_Production_Country) %>%
  mutate(low_movie_count = n() <= 10) %>%
  ungroup()
#view(t_box_office)

# create a data frame that includes a column for less than 10 movies
low_movie_count_stats <- t_box_office |> 
  group_by(low_movie_count) |> 
  summarize(count = n(), 
            probability = round(count / nrow(t_box_office) * 100, 2), .groups = 'drop'     # Ensure ungrouped data 
            )

Create the visualization

low_movie_count_stats |> 
  ggplot(aes(
    x = low_movie_count, 
    y = count, 
    fill = as.factor(low_movie_count)
    )) + 
  geom_bar(stat = "identity") + 
  geom_text(aes(
    label = paste0(probability, "%")), 
    vjust = -0.5) + 
  labs(title = "Distribution of Movies by Low Movie Count", 
       subtitle = "Probability of each value in the data set", 
       x = "Low Movie Count", 
       y = "Count") + 
  theme_classic() + 
  scale_fill_manual(values = c("TRUE" = "red", "FALSE" = "blue")) + 
  theme(legend.position = "none") +
    annotate("text", 
             x = 1.5, 
             y = max(low_movie_count_stats$count) + 5, 
           label = "This distribution indicates that most \ncountries represented have produced at \nleast 10 highly rated movies \nsince 2000.", 
           hjust = 0, 
           vjust = 1, 
           size = 4, 
           color = "blue")

Insights

What is a testable hypothesis?

Countries that have more movies on this list will have a higher average income than countries that have less than 10 movies on the list

Group By #3 - What is the distribution of genres across countries?

Create/Update the data set

# Calculate the count for each production country
country_counts <- t_box_office |>
  filter(!is.na(Prime_Production_Country)) |>
  count(Prime_Production_Country) |>
  arrange(desc(n)) |>
  slice(1:8) # Select top 8 countries

# Filter the original data to include only the top 8 countries
top_countries_data <- t_box_office |>
  filter(Prime_Production_Country %in% country_counts$Prime_Production_Country, 
         !is.na(Prime_Genre))

# Summarize the data to count the number of movies for each combination of Prime_Genre and Prime_Production_Country
genre_country_counts <- top_countries_data |>
  count(Prime_Genre, 
        Prime_Production_Country)

# Arrange genres in descending order based on total movie count
genre_order <- genre_country_counts |>
  group_by(Prime_Genre) |>
  summarize(total_movies = sum(n)) |>
  arrange(desc(total_movies)) |>
  pull(Prime_Genre)

Create the visualization

# Create the stacked bar chart
ggplot(genre_country_counts, aes(x = factor(Prime_Genre, levels = genre_order), y = n, fill = Prime_Production_Country)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of Movies by Genre and Production Country (Top 8 Countries)",
       subtitle = "Stacked bar chart showing the distribution of movies by genre and production country",
       x = "Genre",
       y = "Count") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set1", name = "Production Country")

Insights

Had to limit this visualization to the top 8 countries for visual reasons. Looking at all countries was extremely messy.
While I was trying to stay consistent with the dark2 color palette, the Set1 color palette as more distinctive in this case.

Looking at this data what is a testable hypotheses?

The frequency of a Genre is related to it’s worldwide income. Genres which on average have a higher income happen more frequently.

Task 2 - Build data frames of two categorical variables with all their combinations

Pick two categorical variables, and build a data frame of all their combinations (i.e., the unique rows among the two columns).

Create the data set

# Calculate the number of movies for each country
movie_count <- table(t_box_office$Prime_Production_Country)             # a frequency table
movie_count <- as.data.frame(movie_count)                               #convert it to a df
colnames(movie_count) <- c("Country",                                   #rename the columns
                           "Number_of_Movies")

# Sort by the number of movies in descending order and select the top 10 countries
#this code uses the head() function to grab the top 10 rows and pull the Country row. 
top_countries_movie_count <- head(movie_count[order(-movie_count$Number_of_Movies),  
                                              ], 
                                  10)$Country
# create the 'dstnct_genre object
dstnct_genre <- t_box_office |>                                          # Use the t_box_office data frame
  distinct(Prime_Genre) |>                                               # Get distinct values of the Prime_Genre column
  pull(Prime_Genre) 

# Create a data frame with only the top 10 countries by the largest number of movies
df_top_movie_count <- expand.grid(Country = top_countries_movie_count, 
                                  Genre = dstnct_genre)
df_top_movie_count <- na.omit(df_top_movie_count)

# Check for existence of each combination in t_box_office
df_top_movie_count$Exists <- apply(df_top_movie_count, 
                                   1, 
                                   function(row) {
  any(t_box_office$Prime_Production_Country == row["Country"] & 
      t_box_office$Prime_Genre == row["Genre"])
})

Create the visualization

# Visualize the data using a tile plot
ggplot(df_top_movie_count, aes(x = Genre, 
                               y = Country, 
                               fill = Exists)) +
  geom_tile(color = "white") +
  scale_fill_manual(values = c("TRUE" = "skyblue", 
                               "FALSE" = "grey90")) +
  labs(title = "Top 10 Countries by Largest Number of Movies",
       x = "Genre",
       y = "Country",
       fill = "Exists") +
  theme_classic() +
  theme(axis.text.y = element_text(size = 10, 
                                   margin = margin(r = 1)),
        axis.text.x = element_text(angle = 45, 
                                   hjust = 1),
        axis.title.y = element_text(margin = margin(t = 10, 
                                                    b = 10)),
        plot.margin = margin(t = 20, 
                             r = 20, 
                             b = 20, 
                             l = 20)) +
  coord_fixed(ratio = 5 / 5)

Insights

This final plot was difficult. Creating the data frame wasn’t too hard, but it wasn’t until I found the tile plot that I was able to visualize the data effectively. The original goal was to list all countries in the Y axis, but it never looked clean and attempts to manage row heights didn’t work as expected. To make the plot more meaningful, filtering the data to the top 10 countries with the largest number of movies seemed the cleanest. It uses the combination data set to tell an easy to interpret story. US, UK and Germany all have produced top 200 movies of every genre since 2000. India is in the top 5 of movies in the top 200’s in all but 5 Genres. Only 4 countries, in the top 10, have produced Westerns that made it to the top 200.
This is a useful look at this data. In further analysis including a companion version that looks at the top 10 avg money makers would also be interesting.

Insights - Overall

Learned some things about my coding preferences. I like multi-part statements, ie functions separated by commas to each be on their own line. This helps me think about each part of the function so I can be aware of which part might be missed. This goes along also with my commenting preference. By separating each statement or function with commas it is easier to add right side comments to address anything individual lines that were troublesome or interesting. When the lines get long it’s harder for me to read and see what belongs where. I guess, I don’t like scrolling to the right.
Also, I learned my commenting lessened the deeper I got. As I understand the basics more I am not feeling the necessity to comment each line, just the main actions, or items that are confusing.

H510 Data Dive wk3 DJH

Week 3 Data Dive

Data Dive Tasks:

Initialization Step 1 - Load Libraries

Initialization Step 2 - Load data set

Insights

Task demonstrations

Task 1 - Build 3 Group by data frames

Group By #1 - What is the average revenue by each genre?

Create the data set

Create a visualization

Insights

Group By # 2 - What is the probability that a country has less than 10 movies in the top 200 since 2000?

Create the data set

Create the visualization

Insights

Group By #3 - What is the distribution of genres across countries?

Create/Update the data set

Create the visualization

Insights

Task 2 - Build data frames of two categorical variables with all their combinations

Create the data set

Create the visualization

Insights

Insights - Overall