At least 3 “group by” data frames, and an investigation into
each. You’ll need to use categorical columns, or one of
the cut_
functions here
Use the group_by
function to group your data into
(at least) 3 different sets of groups, each summarizing different
variables.
For example, this could be as simple as three data frames which group your data based on three different categorical columns, but summarize the same continuous column. Or, it could be as complex as three different combinations of categorical columns, each illustrating summarizations of different continuous (or categorical columns).
Each group in a group_by
dataframe will have a
number of rows associated with it (e.g., if you only group by a single
column, then this is the result of count
). So, if we were
to randomly select a row from your dataset, the smallest groups have a
lowest probability of being selected.
Assign the lowest probability group(s) a special tag, and then translate that back into the context of your data. Draw some conclusions about the groups you’ve found. (I.e., in other words, what does it mean that Group X is the smallest? Can you phrase this in terms of probability?)
Draw a testable hypothesis for why some groups are rarer than others (i.e., something quantifiable).
Build at least one visualization for each of these three groupings.
Pick two categorical variables, and build a data frame of all their combinations (i.e., the unique rows among the two columns).
count
or group_by
.*explain what insight was gained.
#load the data libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(scales) #loaded to address currency issues
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
options(scipen=999) #disable scientific notation since high values are used
*Using an adjusted version of the enhanced_box_office_data(2000-2024)u.csv which has the following column changes to support cleaner coding and analyses. Changes listed here were made directly in the csv source file.
prime_genre
was added to show the primary genre
selected for each film. The initial value in the comma separated data
was defaulted as the primary genre. Data review shows genres were not
always the same place in the order suggesting a hierarchy.
prime_production_company
was added to show the
primary production country for each film. It is unclear through the data
if there is any precedence as countries are most frequently listed
alphabetically. For the purpose of this review the intial value in the
comma separated data was selected as the primary.
rating_of_10
was added to represent the rating in a
format that allows further calculation. The original format of x.xx/10
was converted to x.xx after all values were confirmed to be on the same
scale of 10
rating_scale
was added to confirm that all ratings
are on the same scale. Review shows all values are either 10 or null
confirming that all non null instances of rating_of_10
can
be measured against each other as they match the same scale.
*All further data dives will use the same adjusted data set and will not explicitly mention these adjustments. These adjustments will be called out in the final project for publication.
#load the adjusted version of the csv with the changes from above
t_box_office <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv", delim = ",")
## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In working on this weeks assigments some attention was brought to the idea of when in the code should data management occur. I considered adding columns to the primary data frame in the code block above so those changes would be available in the tasks for the week. I chose not to do this in favor of adding changes to the data frame in conjunction with the task being performed to make the activities follow one behind the other logically instead of requiring th reviewer to scroll back to the top for clarity.
For these tasks I am following pattern:
Set up the requested data frame, including any code required to update the primary data frame with support information.
Create a visualization that explores the data.
Provide Insights or hypotheses
At least 3 “group by” data frames, and an investigation into each.
You’ll need to use categorical columns, or one of
the cut_
functions here
For these tests it is important to recognize the scope of this data set. The data set represents Ratings and earnings for the top 200 rated movies each year between 2000 and 2024.
For this task I am interested in what the average revenue is for each genre. As this is a large data set I am limiting the results to the top 8 average genres.
#Create a summary data set of ww revenue by Genre
Worldwide_rev_by_genre <- t_box_office |> #create a new df
group_by(Prime_Genre) |> #group the data by genre
summarise( #add summary data to the df
avg_ww_revenue = mean(`$Worldwide`, na.rm = TRUE), #add a column for the mean of the $Worldwide by genre
count = n() #add a column for the number of rows in each genre
) |>
arrange((count)) #sorts the data in ascending order by default
#Worldwide_rev_by_genre
#Select the top 8 genres by average worldwide revenue
#Filter the data set to the top * values for better visualization
top_8_genres <- Worldwide_rev_by_genre |> #create a new df
arrange(desc(avg_ww_revenue)) |> #arrange the data in descending order
slice(1:8) #pulls the first 8 rows, works because the data is ordered
# Create the bar chart
base_plot <- ggplot(top_8_genres,
aes(x = reorder(Prime_Genre, #x is the reordered version of genre
-avg_ww_revenue), #order in descending order by the avg revenue
y = avg_ww_revenue, #y is the avg revenue
fill = Prime_Genre)) + #bars are categorized by genre color
geom_bar(stat = "identity") + #height is determined by data vals
scale_y_continuous(labels = dollar_format()) + #formatting as currency
labs(title = "Average Worldwide Value by Genre (Top 8 Genres)",
subtitle = "Bar chart showing the average worldwide value for the top 8 genres in descending order",
x = "Genre",
y = "Average Worldwide Value") +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Dark2", name = "Genre") #use color brewer Dark2
base_plot
Which group has the lowest probability? The graph shows “Western”.
Adding a column to this data set that tests if a row is Western.
#update the data frame to add a column to represent the the rows which are 'Westerns'
t_box_office <- t_box_office |>
mutate(IsWestern = Prime_Genre == "Western") #add a boolean column called IsWestern
#view(t_box_office)
Adding an annotation to the graph that calculates the probability that a selected movie is a Western.
# Check if 'Iswestern' column exists and has valid data
if ("IsWestern" %in% colnames(t_box_office) && !all(is.na(t_box_office$IsWestern))) { #makes sure all rows are not NA and IsWestern is a valid name
prob_western <- round(sum(t_box_office$IsWestern,
na.rm = TRUE) / nrow(t_box_office) * 100, 2)
} else {
prob_western <- NA
}
# Create the annotated plot
annotated_plot <- base_plot +
annotate("text",
x = 3, y = max(top_8_genres$avg_ww_revenue) * 0.95, #where will the annotation be placed? 95% of the Max value
label = paste("The probability that a selected \nmovie is a Western is", prob_western, "%"), #uses the calculated probability for Western
size = 5,
color = "blue",
hjust = 0)
# Display the plot with the annotation
annotated_plot
What is a testable hypothesis?
The average worldwide value of the genre Western is
dis-proportionate to the number or Westerns that made it into the top
200.
#add a column to the t_box_office data frame to flag countries with less than 10 movies in the df.
t_box_office <- t_box_office %>%
group_by(Prime_Production_Country) %>%
mutate(low_movie_count = n() <= 10) %>%
ungroup()
#view(t_box_office)
# create a data frame that includes a column for less than 10 movies
low_movie_count_stats <- t_box_office |>
group_by(low_movie_count) |>
summarize(count = n(),
probability = round(count / nrow(t_box_office) * 100, 2), .groups = 'drop' # Ensure ungrouped data
)
low_movie_count_stats |>
ggplot(aes(
x = low_movie_count,
y = count,
fill = as.factor(low_movie_count)
)) +
geom_bar(stat = "identity") +
geom_text(aes(
label = paste0(probability, "%")),
vjust = -0.5) +
labs(title = "Distribution of Movies by Low Movie Count",
subtitle = "Probability of each value in the data set",
x = "Low Movie Count",
y = "Count") +
theme_classic() +
scale_fill_manual(values = c("TRUE" = "red", "FALSE" = "blue")) +
theme(legend.position = "none") +
annotate("text",
x = 1.5,
y = max(low_movie_count_stats$count) + 5,
label = "This distribution indicates that most \ncountries represented have produced at \nleast 10 highly rated movies \nsince 2000.",
hjust = 0,
vjust = 1,
size = 4,
color = "blue")
What is a testable hypothesis?
Countries that have more movies on this list will have a higher average income than countries that have less than 10 movies on the list
# Calculate the count for each production country
country_counts <- t_box_office |>
filter(!is.na(Prime_Production_Country)) |>
count(Prime_Production_Country) |>
arrange(desc(n)) |>
slice(1:8) # Select top 8 countries
# Filter the original data to include only the top 8 countries
top_countries_data <- t_box_office |>
filter(Prime_Production_Country %in% country_counts$Prime_Production_Country,
!is.na(Prime_Genre))
# Summarize the data to count the number of movies for each combination of Prime_Genre and Prime_Production_Country
genre_country_counts <- top_countries_data |>
count(Prime_Genre,
Prime_Production_Country)
# Arrange genres in descending order based on total movie count
genre_order <- genre_country_counts |>
group_by(Prime_Genre) |>
summarize(total_movies = sum(n)) |>
arrange(desc(total_movies)) |>
pull(Prime_Genre)
# Create the stacked bar chart
ggplot(genre_country_counts, aes(x = factor(Prime_Genre, levels = genre_order), y = n, fill = Prime_Production_Country)) +
geom_bar(stat = "identity") +
labs(title = "Number of Movies by Genre and Production Country (Top 8 Countries)",
subtitle = "Stacked bar chart showing the distribution of movies by genre and production country",
x = "Genre",
y = "Count") +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set1", name = "Production Country")
Had to limit this visualization to the top 8 countries for visual reasons. Looking at all countries was extremely messy.
While I was trying to stay consistent with the dark2 color palette, the Set1 color palette as more distinctive in this case.
Looking at this data what is a testable hypotheses?
Pick two categorical variables, and build a data frame of all their combinations (i.e., the unique rows among the two columns).
# Calculate the number of movies for each country
movie_count <- table(t_box_office$Prime_Production_Country) # a frequency table
movie_count <- as.data.frame(movie_count) #convert it to a df
colnames(movie_count) <- c("Country", #rename the columns
"Number_of_Movies")
# Sort by the number of movies in descending order and select the top 10 countries
#this code uses the head() function to grab the top 10 rows and pull the Country row.
top_countries_movie_count <- head(movie_count[order(-movie_count$Number_of_Movies),
],
10)$Country
# create the 'dstnct_genre object
dstnct_genre <- t_box_office |> # Use the t_box_office data frame
distinct(Prime_Genre) |> # Get distinct values of the Prime_Genre column
pull(Prime_Genre)
# Create a data frame with only the top 10 countries by the largest number of movies
df_top_movie_count <- expand.grid(Country = top_countries_movie_count,
Genre = dstnct_genre)
df_top_movie_count <- na.omit(df_top_movie_count)
# Check for existence of each combination in t_box_office
df_top_movie_count$Exists <- apply(df_top_movie_count,
1,
function(row) {
any(t_box_office$Prime_Production_Country == row["Country"] &
t_box_office$Prime_Genre == row["Genre"])
})
# Visualize the data using a tile plot
ggplot(df_top_movie_count, aes(x = Genre,
y = Country,
fill = Exists)) +
geom_tile(color = "white") +
scale_fill_manual(values = c("TRUE" = "skyblue",
"FALSE" = "grey90")) +
labs(title = "Top 10 Countries by Largest Number of Movies",
x = "Genre",
y = "Country",
fill = "Exists") +
theme_classic() +
theme(axis.text.y = element_text(size = 10,
margin = margin(r = 1)),
axis.text.x = element_text(angle = 45,
hjust = 1),
axis.title.y = element_text(margin = margin(t = 10,
b = 10)),
plot.margin = margin(t = 20,
r = 20,
b = 20,
l = 20)) +
coord_fixed(ratio = 5 / 5)
This final plot was difficult. Creating the data frame wasn’t too hard, but it wasn’t until I found the tile plot that I was able to visualize the data effectively. The original goal was to list all countries in the Y axis, but it never looked clean and attempts to manage row heights didn’t work as expected. To make the plot more meaningful, filtering the data to the top 10 countries with the largest number of movies seemed the cleanest. It uses the combination data set to tell an easy to interpret story. US, UK and Germany all have produced top 200 movies of every genre since 2000. India is in the top 5 of movies in the top 200’s in all but 5 Genres. Only 4 countries, in the top 10, have produced Westerns that made it to the top 200.
This is a useful look at this data. In further analysis including a companion version that looks at the top 10 avg money makers would also be interesting.
Learned some things about my coding preferences. I like multi-part statements, ie functions separated by commas to each be on their own line. This helps me think about each part of the function so I can be aware of which part might be missed. This goes along also with my commenting preference. By separating each statement or function with commas it is easier to add right side comments to address anything individual lines that were troublesome or interesting. When the lines get long it’s harder for me to read and see what belongs where. I guess, I don’t like scrolling to the right.
Also, I learned my commenting lessened the deeper I got. As I understand the basics more I am not feeling the necessity to comment each line, just the main actions, or items that are confusing.