Week 3 Data Dive

Tasks:

  1. At least 3 “group by” data frames, and an investigation into each. You’ll need to use categorical columns, or one of the cut_ functions here

    • Use the group_by function to group your data into (at least) 3 different sets of groups, each summarizing different variables.

    • For example, this could be as simple as three data frames which group your data based on three different categorical columns, but summarize the same continuous column. Or, it could be as complex as three different combinations of categorical columns, each illustrating summarizations of different continuous (or categorical columns).

    • Each group in a group_by dataframe will have a number of rows associated with it (e.g., if you only group by a single column, then this is the result of count). So, if we were to randomly select a row from your dataset, the smallest groups have a lowest probability of being selected.

    • Assign the lowest probability group(s) a special tag, and then translate that back into the context of your data. Draw some conclusions about the groups you’ve found. (I.e., in other words, what does it mean that Group X is the smallest? Can you phrase this in terms of probability?)

    • Draw a testable hypothesis for why some groups are rarer than others (i.e., something quantifiable).

    • Build at least one visualization for each of these three groupings.

  2. Pick two categorical variables, and build a data frame of all their combinations (i.e., the unique rows among the two columns).

    • (Only if it exists…) Find a combination that does not exist in the data. (For example, if you have a column for “color” and a column for “size”, you might be missing a row that is both “Blue” and “Medium”.) Why do you think these are missing?
    • Which combinations are the most/least common, and why might that be? Hint: for this, you’ll need to use count or group_by.
    • Find a way to visualize at least one of the combinations.

*explain what insight was gained.

Step 1 - Load Libraries

library(tidyverse) #load the tidyverse library}
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Step 2 - Load data set

*Using an adjusted version of the enhanced_box_office_data(2000-2024)u.csv which has the following column changes to support cleaner coding and analyses.

  • prime_genre was added to show the primary genre selected for each film. The initial value in the comma separated data was defaulted as the primary genre. Data review shows genres were not always the same place in the order suggesting a hierarchy.

  • prime_production_company was added to show the primary production country for each film. It is unclear through the data if there is any precedence as countries are most frequently listed alphabetically. For the purpose of this review the intial value in the comma separated data was selected as the primary.

  • rating_of_10 was added to represent the rating in a format that allows further calculation. The original format of x.xx/10 was converted to x.xx after all values were confirmed to be on the same scale of 10

  • rating_scale was added to confirm that all ratings are on the same scale. Review shows all values are either 10 or null confirming that all non null instances of rating_of_10 can be measured against each other as they match the same scale.

*All further data dives will use the same adjusted data set and will not explicitly mention these adjustments. These adjustments will be called out in the final project for publication.

t_box_office <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv", delim = ",")
## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
view(t_box_office)

Task demonstrations

Task 1 - Build 3 Group by data frames

At least 3 “group by” data frames, and an investigation into each. You’ll need to use categorical columns, or one of the cut_ functions here

  1. Group By 1 - What is the average revenue by each genre?

    Worldwide_rev_by_genre <- t_box_office |>                                          
      group_by(Prime_Genre) |>
      summarise(avg_ww_revenue = mean(`$Worldwide`))
    
    #Worldwide_rev_by_genre
  2. Group By 2 - What is the average revenue through foreign distribution by Production Country?

    foreign_rev_pcnt_by_PCountry <- t_box_office |>
      group_by(Prime_Production_Country) |>
      summarise(avg_foreign_pcnt = mean(`Foreign %`, na.rm = TRUE)) |>
      arrange((avg_foreign_pcnt))
    
    #foreign_rev_pcnt_by_PCountry
  3. Group By 3 -

    avg_rev <- t_box_office |>
      na.omit() |>
      group_by(yr_group = cut(Year, breaks = c(2000, 2005, 2010, 2015, 2020, 2025), include.lowest = TRUE))|>
      summarise(mean_revenue = mean(`$Worldwide`))
    
    #avg_rev

Task 2 - Build data frames of two categorical variables with all their combinations

Pick two categorical variables, and build a data frame of all their combinations (i.e., the unique rows among the two columns).

dstnct_genre <- unique(t_box_office$Prime_Genre)
dstnct_Country <- unique(t_box_office$Prime_Production_Country)
df <- expand.grid(dstnct_Country,dstnct_genre)
#df
t_box_office |>
  group_by(Prime_Genre, Prime_Production_Country) |>
  summarise(count = n(), .groups = "keep") |>
  arrange((count))
## # A tibble: 374 × 3
## # Groups:   Prime_Genre, Prime_Production_Country [374]
##    Prime_Genre Prime_Production_Country count
##    <chr>       <chr>                    <int>
##  1 Action      Belgium                      1
##  2 Action      Cambodia                     1
##  3 Action      Finland                      1
##  4 Action      Indonesia                    1
##  5 Action      Luxembourg                   1
##  6 Action      Morocco                      1
##  7 Action      New Zealand                  1
##  8 Action      Nigeria                      1
##  9 Action      Poland                       1
## 10 Action      South Africa                 1
## # ℹ 364 more rows
genre_country_counts <- t_box_office |>
  group_by(Prime_Genre, Prime_Production_Country) |>
  summarise(count = n(), .groups = "keep") |>
  arrange(desc(count))

#genre_country_counts
countries_without_westerns <- t_box_office |>
  distinct(Prime_Production_Country) |>
  filter(!Prime_Production_Country %in% t_box_office)
#  filter(Prime_Genre == "Western") |>
#  distinct(Prime_Production_Country))

#countries_without_westerns