Week 4 Bechel Test - Data Dive

#loading the data and needed libraries into the markdown file
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)

bechdel_data_raw <- read_csv("C:/Users/Lauren/Documents/Stats Data/raw_bechdel.csv")
## Rows: 8839 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): imdb_id, title
## dbl (3): year, id, rating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bechdel_data_movies <- read_csv("C:/Users/Lauren/Documents/Stats Data/movies.csv")
## Rows: 1794 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): imdb, title, test, clean_test, binary, domgross, intgross, code, d...
## dbl  (7): year, budget, budget_2013, period_code, decade_code, metascore, im...
## num  (1): imdb_votes
## lgl  (2): response, error
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#helper code
# change this number, and consider how it affects the sub-sample analysis
sample_frac = 0.25

# number of samples to scrutinize
n_samples = 3

df_samples = tibble()  # empty dataframe to append to

for (sample_i in 1:n_samples) {
  df_i <- bechdel_data_movies |>
    sample_n(size = sample_frac * nrow(bechdel_data_movies), replace = TRUE) |>
    mutate(sample_num = sample_i)  # add a column indicating sample number
  
  df_samples = bind_rows(df_samples, df_i)
}
#Checking for sample tibble validity. 
df_samples
## # A tibble: 1,344 × 35
##     year imdb      title  test  clean_test binary budget domgross intgross code 
##    <dbl> <chr>     <chr>  <chr> <chr>      <chr>   <dbl> <chr>    <chr>    <chr>
##  1  2012 tt1735898 Snow … ok    ok         PASS   1.7 e8 1551367… 4009112… 2012…
##  2  2013 tt2024544 12 Ye… nota… notalk     FAIL   2   e7 53107035 1586070… 2013…
##  3  2004 tt0327679 Ella … ok    ok         PASS   3.5 e7 22913677 22913677 2004…
##  4  2006 tt0454848 Insid… dubi… dubious    FAIL   5   e7 88634237 1846342… 2006…
##  5  2004 tt0299172 Home … ok    ok         PASS   1.10e8 50026353 76482461 2004…
##  6  2009 tt0795351 Case … ok    ok         PASS   2.70e7 13261851 29302804 2009…
##  7  2004 tt0293508 The P… men-… men        FAIL   5.50e7 51225796 1582257… 2004…
##  8  2007 tt0861739 Tropa… ok-d… ok         PASS   6.54e6 8744     14319195 2007…
##  9  2008 tt0981227 Nick … ok    ok         PASS   1   e7 31487293 33544594 2008…
## 10  2000 tt0120913 Titan… ok    ok         PASS   7.5 e7 22751979 36751979 2000…
## # ℹ 1,334 more rows
## # ℹ 25 more variables: budget_2013 <dbl>, domgross_2013 <chr>,
## #   intgross_2013 <chr>, period_code <dbl>, decade_code <dbl>, imdb_id <chr>,
## #   plot <chr>, rated <chr>, response <lgl>, language <chr>, country <chr>,
## #   writer <chr>, metascore <dbl>, imdb_rating <dbl>, director <chr>,
## #   released <chr>, actors <chr>, genre <chr>, awards <chr>, runtime <chr>,
## #   type <chr>, poster <chr>, imdb_votes <dbl>, error <lgl>, sample_num <int>
#separate the samples into their own dataframes :)
samples_1 <- filter(df_samples, sample_num == 1)
samples_1
## # A tibble: 448 × 35
##     year imdb      title  test  clean_test binary budget domgross intgross code 
##    <dbl> <chr>     <chr>  <chr> <chr>      <chr>   <dbl> <chr>    <chr>    <chr>
##  1  2012 tt1735898 Snow … ok    ok         PASS   1.7 e8 1551367… 4009112… 2012…
##  2  2013 tt2024544 12 Ye… nota… notalk     FAIL   2   e7 53107035 1586070… 2013…
##  3  2004 tt0327679 Ella … ok    ok         PASS   3.5 e7 22913677 22913677 2004…
##  4  2006 tt0454848 Insid… dubi… dubious    FAIL   5   e7 88634237 1846342… 2006…
##  5  2004 tt0299172 Home … ok    ok         PASS   1.10e8 50026353 76482461 2004…
##  6  2009 tt0795351 Case … ok    ok         PASS   2.70e7 13261851 29302804 2009…
##  7  2004 tt0293508 The P… men-… men        FAIL   5.50e7 51225796 1582257… 2004…
##  8  2007 tt0861739 Tropa… ok-d… ok         PASS   6.54e6 8744     14319195 2007…
##  9  2008 tt0981227 Nick … ok    ok         PASS   1   e7 31487293 33544594 2008…
## 10  2000 tt0120913 Titan… ok    ok         PASS   7.5 e7 22751979 36751979 2000…
## # ℹ 438 more rows
## # ℹ 25 more variables: budget_2013 <dbl>, domgross_2013 <chr>,
## #   intgross_2013 <chr>, period_code <dbl>, decade_code <dbl>, imdb_id <chr>,
## #   plot <chr>, rated <chr>, response <lgl>, language <chr>, country <chr>,
## #   writer <chr>, metascore <dbl>, imdb_rating <dbl>, director <chr>,
## #   released <chr>, actors <chr>, genre <chr>, awards <chr>, runtime <chr>,
## #   type <chr>, poster <chr>, imdb_votes <dbl>, error <lgl>, sample_num <int>
samples_2 <- filter(df_samples, sample_num == 2)
samples_2
## # A tibble: 448 × 35
##     year imdb      title  test  clean_test binary budget domgross intgross code 
##    <dbl> <chr>     <chr>  <chr> <chr>      <chr>   <dbl> <chr>    <chr>    <chr>
##  1  2001 tt0236348 Josie… ok    ok         PASS   2.20e7 14252830 14252830 2001…
##  2  2010 tt1001526 Megam… nowo… nowomen    FAIL   1.3 e8 1484158… 3218872… 2010…
##  3  2005 tt0450278 Hostel nota… notalk     FAIL   4.80e6 47326473 82241110 2005…
##  4  2006 tt0429589 The A… men   men        FAIL   4.50e7 28142535 49610898 2006…
##  5  2013 tt0790736 R.I.P… nota… notalk     FAIL   1.3 e8 33618855 79019947 2013…
##  6  2011 tt1637688 In Ti… nota… notalk     FAIL   3.5 e7 37553932 1700646… 2011…
##  7  1991 tt0102510 The N… nota… notalk     FAIL   2.3 e7 86930411 86930411 1991…
##  8  2006 tt0405094 Das L… nowo… nowomen    FAIL   2   e6 11284657 81197047 2006…
##  9  2010 tt0938283 The L… nota… notalk     FAIL   1.5 e8 1317721… 3197138… 2010…
## 10  2003 tt0282209 Darkn… nota… notalk     FAIL   7   e6 32539681 32539681 2003…
## # ℹ 438 more rows
## # ℹ 25 more variables: budget_2013 <dbl>, domgross_2013 <chr>,
## #   intgross_2013 <chr>, period_code <dbl>, decade_code <dbl>, imdb_id <chr>,
## #   plot <chr>, rated <chr>, response <lgl>, language <chr>, country <chr>,
## #   writer <chr>, metascore <dbl>, imdb_rating <dbl>, director <chr>,
## #   released <chr>, actors <chr>, genre <chr>, awards <chr>, runtime <chr>,
## #   type <chr>, poster <chr>, imdb_votes <dbl>, error <lgl>, sample_num <int>
samples_3 <- filter(df_samples, sample_num == 3)
samples_3
## # A tibble: 448 × 35
##     year imdb      title  test  clean_test binary budget domgross intgross code 
##    <dbl> <chr>     <chr>  <chr> <chr>      <chr>   <dbl> <chr>    <chr>    <chr>
##  1  2005 tt0416320 Match… ok    ok         PASS   1.5 e7 23089926 87989926 2005…
##  2  2010 tt0892318 Lette… dubi… dubious    FAIL   3   e7 53032453 79135982 2010…
##  3  2011 tt1093357 The D… ok    ok         PASS   3.48e7 21443494 62831715 2011…
##  4  2000 tt0212338 Meet … men   men        FAIL   5.50e7 1662250… 3045998… 2000…
##  5  2004 tt0368447 The V… ok    ok         PASS   7.17e7 1141975… 2576416… 2004…
##  6  2013 tt2184339 The P… dubi… dubious    FAIL   3   e6 64473115 91100541 2013…
##  7  2012 tt1646987 Wrath… nota… notalk     FAIL   1.5 e8 83670083 3019700… 2012…
##  8  1995 tt0112697 Cluel… ok    ok         PASS   1.37e7 56598476 56598476 1995…
##  9  2001 tt0203009 Mouli… ok    ok         PASS   5.3 e7 57386369 1792131… 2001…
## 10  1996 tt0115759 Broke… nowo… nowomen    FAIL   6.5 e7 70645997 1483459… 1996…
## # ℹ 438 more rows
## # ℹ 25 more variables: budget_2013 <dbl>, domgross_2013 <chr>,
## #   intgross_2013 <chr>, period_code <dbl>, decade_code <dbl>, imdb_id <chr>,
## #   plot <chr>, rated <chr>, response <lgl>, language <chr>, country <chr>,
## #   writer <chr>, metascore <dbl>, imdb_rating <dbl>, director <chr>,
## #   released <chr>, actors <chr>, genre <chr>, awards <chr>, runtime <chr>,
## #   type <chr>, poster <chr>, imdb_votes <dbl>, error <lgl>, sample_num <int>
#some data set benchmarks of the whole data set
# Overall pass/fail rate
table_pass_fail <- table(bechdel_data_movies$binary)
table_pass_fail
## 
## FAIL PASS 
##  991  803
barplot(table_pass_fail)

# Overall IMDB Rating
IMDB_rating_overall <- bechdel_data_movies |>
  summarise( mean_IMDB = mean(imdb_rating, na.rm = TRUE), median_IMDB = median(imdb_rating, na.rm = TRUE) )

IMDB_rating_overall
## # A tibble: 1 × 2
##   mean_IMDB median_IMDB
##       <dbl>       <dbl>
## 1      6.76         6.8
#Overall Budget Adjusted to 2013 USD

budget_overall <- bechdel_data_movies |>
  summarise(mean_budget_2013 = mean(budget_2013, na.rm = TRUE), median_budget_2013 = median(budget_2013, na.rm = TRUE))

budget_overall
## # A tibble: 1 × 2
##   mean_budget_2013 median_budget_2013
##              <dbl>              <dbl>
## 1        55464608.           36995786
#Group By Rated and show count

common_ratings_bar <- ggplot(bechdel_data_movies, aes(x=rated)) + geom_bar()
common_ratings_bar

#testing out sample pass fail rates
table_pass_fail_sample1 <- table(samples_1$binary)
table_pass_fail_sample1
## 
## FAIL PASS 
##  238  210
barplot(table_pass_fail_sample1)

table_pass_fail_sample2 <- table(samples_2$binary)
table_pass_fail_sample2
## 
## FAIL PASS 
##  257  191
barplot(table_pass_fail_sample2)

table_pass_fail_sample3 <- table(samples_3$binary)
table_pass_fail_sample3
## 
## FAIL PASS 
##  221  227
barplot(table_pass_fail_sample3)

Pass/Fail Rate

The overall pass (of the bechdel test) rate of the whole data set is approximately 44.8%. The sample sets have pass rates from 43.3 to 46.4%, thus they are not very different. The bechdel test pass rates of the samples are pretty close to the unsampled data set! As the sample size increases, the pass rate gets closer and closer to the pass rate of the whole set. (This is an example of the law of large numbers!) The pass/fail rate is also an aspect of the data that is relatively consistent among sampled and unsampled data.

#Sample IMDB Ratings
IMDB_rating_1 <- samples_1 |>
  summarise( mean_IMDB1 = mean(imdb_rating, na.rm = TRUE), median_IMDB1 = median(imdb_rating, na.rm = TRUE) )
IMDB_rating_2 <- samples_2 |>
  summarise( mean_IMDB2 = mean(imdb_rating, na.rm = TRUE), median_IMDB2 = median(imdb_rating, na.rm = TRUE) )
IMDB_rating_3 <- samples_3 |>
  summarise( mean_IMDB3 = mean(imdb_rating, na.rm = TRUE), median_IMDB3 = median(imdb_rating, na.rm = TRUE) )
IMDB_rating_1
## # A tibble: 1 × 2
##   mean_IMDB1 median_IMDB1
##        <dbl>        <dbl>
## 1       6.84          6.9
IMDB_rating_2
## # A tibble: 1 × 2
##   mean_IMDB2 median_IMDB2
##        <dbl>        <dbl>
## 1       6.67          6.8
IMDB_rating_3
## # A tibble: 1 × 2
##   mean_IMDB3 median_IMDB3
##        <dbl>        <dbl>
## 1       6.66          6.8

IMDB Ratings

The IMDB ratings of the samples are also pretty similar to that of the unsampled data set, both mean and median. The mean ratings of the samples vary from 6.74 to 6.78 (unsampled mean is 6.76), while the medians are all 6.8. This is another aspect that is consistent across the sampled and unsampled data set.

#budgets for samples
budget_1 <- samples_1 |>
  summarise(mean_budget_2013_1 = mean(budget_2013, na.rm = TRUE), median_budget_2013_1 = median(budget_2013, na.rm = TRUE))

budget_2 <- samples_2 |>
  summarise(mean_budget_2013_2 = mean(budget_2013, na.rm = TRUE), median_budget_2013_2 = median(budget_2013, na.rm = TRUE))

budget_3 <- samples_3 |>
  summarise(mean_budget_2013_3 = mean(budget_2013, na.rm = TRUE), median_budget_2013_3 = median(budget_2013, na.rm = TRUE))

budget_1 
## # A tibble: 1 × 2
##   mean_budget_2013_1 median_budget_2013_1
##                <dbl>                <dbl>
## 1          54512970.             37157440
budget_2
## # A tibble: 1 × 2
##   mean_budget_2013_2 median_budget_2013_2
##                <dbl>                <dbl>
## 1          57543001.             38855376
budget_3
## # A tibble: 1 × 2
##   mean_budget_2013_3 median_budget_2013_3
##                <dbl>                <dbl>
## 1          53844695.             35721383

Budget (in 2013 USD)

Interestingly here, while the median and mean budgets of the samples are close to one another, they are all higher than the median and mean budget for the unsampled data set. This leads me to believe there are a few very low budget movies in the whole data set that brings things down that were not included as heavily in the samples. Thus, I would consider finding a small budget film in the samples to be more anomalous than finding that same film in the whole set.

#Ratings spread in the 
sample_ratings_bar1 <- ggplot(samples_1, aes(x=rated)) + geom_bar()
sample_ratings_bar2 <- ggplot(samples_2, aes(x=rated)) + geom_bar()
sample_ratings_bar3 <- ggplot(samples_3, aes(x=rated)) + geom_bar()
sample_ratings_bar1

sample_ratings_bar2

sample_ratings_bar3

Ratings

So, in the ratings we finally see a notable difference from the unsampled data set. Sample 1 lacks “unrated” or X-rated movies that are lightly present in the unsampled data. Sample 3 lacks X-rated movies. All samples lack the TV-14 and TV-PG found in the unsampled data set. I would consider both the unrated and x-rated films being present in Sample 2 (or even within the whole data set) anomalous. As the sample sizes increase and/or become more numerous, I would expect to eventually find all of the rating options. The proportions of the major categories (PG, PG-13, and R) as represented in the bar charts appear to otherwise be similar.

Conclusion

This investigation has led me to only slightly more skeptical of data sets as I already knew that as the smaller they are in relation to the population they purport to represent the more skewed the data can be, after all, if your sample data set was a single row, it wouldn’t represent a lengthy unsampled data set that much. I was surprised to see that most of the sampled metrics were very similar to that of the unsampled data set. The