At least 3 “group by” data frames, and an investigation into
each. You’ll need to use categorical columns, or one of
the cut_
functions here
Use the group_by
function to group your data into
(at least) 3 different sets of groups, each summarizing different
variables.
For example, this could be as simple as three data frames which group your data based on three different categorical columns, but summarize the same continuous column. Or, it could be as complex as three different combinations of categorical columns, each illustrating summarizations of different continuous (or categorical columns).
Each group in a group_by
dataframe will have a
number of rows associated with it (e.g., if you only group by a single
column, then this is the result of count
). So, if we were
to randomly select a row from your dataset, the smallest groups have a
lowest probability of being selected.
Assign the lowest probability group(s) a special tag, and then translate that back into the context of your data. Draw some conclusions about the groups you’ve found. (I.e., in other words, what does it mean that Group X is the smallest? Can you phrase this in terms of probability?)
Draw a testable hypothesis for why some groups are rarer than others (i.e., something quantifiable).
Build at least one visualization for each of these three groupings.
Pick two categorical variables, and build a data frame of all their combinations (i.e., the unique rows among the two columns).
count
or group_by
.*explain what insight was gained.
library(tidyverse) #load the tidyverse library}
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
*Using an adjusted version of the enhanced_box_office_data(2000-2024)u.csv which has the following column changes to support cleaner coding and analyses.
prime_genre
was added to show the primary genre
selected for each film. The initial value in the comma separated data
was defaulted as the primary genre. Data review shows genres were not
always the same place in the order suggesting a hierarchy.
prime_production_company
was added to show the
primary production country for each film. It is unclear through the data
if there is any precedence as countries are most frequently listed
alphabetically. For the purpose of this review the intial value in the
comma separated data was selected as the primary.
rating_of_10
was added to represent the rating in a
format that allows further calculation. The original format of x.xx/10
was converted to x.xx after all values were confirmed to be on the same
scale of 10
rating_scale
was added to confirm that all ratings
are on the same scale. Review shows all values are either 10 or null
confirming that all non null instances of rating_of_10
can
be measured against each other as they match the same scale.
*All further data dives will use the same adjusted data set and will not explicitly mention these adjustments. These adjustments will be called out in the final project for publication.
t_box_office <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv", delim = ",")
## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
view(t_box_office)
At least 3 “group by” data frames, and an investigation into each.
You’ll need to use categorical columns, or one of
the cut_
functions here
Group By 1 - What is the average revenue by each genre?
Worldwide_rev_by_genre <- t_box_office |>
group_by(Prime_Genre) |>
summarise(avg_ww_revenue = mean(`$Worldwide`))
#Worldwide_rev_by_genre
Group By 2 - What is the average revenue through foreign distribution by Production Country?
foreign_rev_pcnt_by_PCountry <- t_box_office |>
group_by(Prime_Production_Country) |>
summarise(avg_foreign_pcnt = mean(`Foreign %`, na.rm = TRUE)) |>
arrange((avg_foreign_pcnt))
#foreign_rev_pcnt_by_PCountry
Group By 3 -
avg_rev <- t_box_office |>
na.omit() |>
group_by(yr_group = cut(Year, breaks = c(2000, 2005, 2010, 2015, 2020, 2025), include.lowest = TRUE))|>
summarise(mean_revenue = mean(`$Worldwide`))
#avg_rev
Pick two categorical variables, and build a data frame of all their combinations (i.e., the unique rows among the two columns).
dstnct_genre <- unique(t_box_office$Prime_Genre)
dstnct_Country <- unique(t_box_office$Prime_Production_Country)
df <- expand.grid(dstnct_Country,dstnct_genre)
#df
t_box_office |>
group_by(Prime_Genre, Prime_Production_Country) |>
summarise(count = n(), .groups = "keep") |>
arrange((count))
## # A tibble: 374 × 3
## # Groups: Prime_Genre, Prime_Production_Country [374]
## Prime_Genre Prime_Production_Country count
## <chr> <chr> <int>
## 1 Action Belgium 1
## 2 Action Cambodia 1
## 3 Action Finland 1
## 4 Action Indonesia 1
## 5 Action Luxembourg 1
## 6 Action Morocco 1
## 7 Action New Zealand 1
## 8 Action Nigeria 1
## 9 Action Poland 1
## 10 Action South Africa 1
## # ℹ 364 more rows
genre_country_counts <- t_box_office |>
group_by(Prime_Genre, Prime_Production_Country) |>
summarise(count = n(), .groups = "keep") |>
arrange(desc(count))
#genre_country_counts
countries_without_westerns <- t_box_office |>
distinct(Prime_Production_Country) |>
filter(!Prime_Production_Country %in% t_box_office)
# filter(Prime_Genre == "Western") |>
# distinct(Prime_Production_Country))
#countries_without_westerns