Week 4 Data Dive

Tasks

The purpose of this week’s data dive is for you to think critically about what might go wrong when it comes time to make conclusions about your data.

Your RMarkdown notebook for this data dive should contain the following:

  • A collection of 5 random samples of your data (with replacement). We are simulating the act of collecting data (like yours) from a population where the simulated “population” here is represented by the data set you already have.

    • Each subsample should be as long as roughly 50% percent of your data. 

    • Store each sample set in a separate data frame (e.g., df_2 might be the second of these samples).

    • Of course, these subsamples should each include both categorical and continuous (numeric) data.

  • Scrutinize these subsamples. Note: you might find group_by quite helpful here!

    • How different are they?

    • What would you have called an anomaly in one sub-sample that you wouldn’t in another?

    • Are there aspects of the data that are consistent among all sub-samples?

  • Consider how this investigation affects how you might draw conclusions about the data in the future.

    • (optional) Try to incorporate Monte Carlo simulations here (see Lab 3) if you can …

For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Step 1 - Load the Libraries

library(tidyverse) #load the tidyverse library}
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)    #load the scales package for use with currency formatting
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(readr)

Step 2 - Load the data set

t_box_office <- read_csv("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv")
## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(t_box_office)[2] <- "Movie_Name"     #change the ambiguous name "Release Group" to the more explicit "Movie_Name"
head(t_box_office)
## # A tibble: 6 × 17
##    Rank Movie_Name  `$Worldwide` `$Domestic` `Domestic %` `$Foreign` `Foreign %`
##   <dbl> <chr>              <dbl>       <dbl>        <dbl>      <dbl>       <dbl>
## 1     1 Mission: I…    546388108   215409889         39.4  330978219        60.6
## 2     2 Gladiator      460583960   187705427         40.8  272878533        59.2
## 3     3 Cast Away      429632142   233632142         54.4  196000000        45.6
## 4     4 What Women…    374111707   182811707         48.9  191300000        51.1
## 5     5 Dinosaur       349822765   137748063         39.4  212074702        60.6
## 6     6 How the Gr…    345842198   260745620         75.4   85096578        24.6
## # ℹ 10 more variables: Year <dbl>, Genres <chr>, Rating <chr>,
## #   Vote_Count <dbl>, Original_Language <chr>, Production_Countries <chr>,
## #   Prime_Genre <chr>, Prime_Production_Country <chr>, Rating_scale <dbl>,
## #   Rating_of_10 <dbl>

Data Load - Insights

Through initial data manipulations I found that the format for the currency columns (or default formatting) was distracting and difficult to follow. I determined that cleaning this up at the load stage of the data set would make following and digesting the data easier. Additionally, the name of the movies is titled “release Group” in the original data set. The documentation did not make it clear why this was selected. Again for clarity I determined that adjusting the column name to Movie_Name would assist in clarity during the analysis tasks to follow.

I had to do some adjustments to the data as when I was trying to run summary data the columns labeled with special characters were by default read as text. With the parse_number() function from the readr package this problem is solved.

Now that I have more familiarity with the data I can start posing some questions that I’d like to ask of the data at some point.

  • Is there any correlation between the score of a movie and it’s gross earnings? How about by Genre or Production Country?

  • Can we establish a trust factor for the score based on the # of votes used for each score?

  • Would it also be clearer to adjust the columns refered to as “rating” to “score” or something similar as rating has a different connotation in movies?

  • Does the primary Production country suggest anything about a movies success at earnings.

  • I also noticed that there is a significant gap in the data as it does not include the budget for the movies. This might create a bias as success is only based on the Gross earnings and ratings where typically success is also based on earnings vs budget.

Task Walkthroughs

Task 1 - Create 5 random samples of data

#Create a subset of the data with just the numeric data
t_bo_subset <- t_box_office[, c("$Worldwide", "$Domestic", "$Foreign", "Vote_Count", "Rating_of_10")]   

head(t_bo_subset)
## # A tibble: 6 × 5
##   `$Worldwide` `$Domestic` `$Foreign` Vote_Count Rating_of_10
##          <dbl>       <dbl>      <dbl>      <dbl>        <dbl>
## 1    546388108   215409889  330978219       6741         6.13
## 2    460583960   187705427  272878533      19032         8.22
## 3    429632142   233632142  196000000      11403         7.66
## 4    374111707   182811707  191300000       3944         6.45
## 5    349822765   137748063  212074702       2530         6.54
## 6    345842198   260745620   85096578       7591         6.8
#create the 5 sample sets
n_samples <- nrow(t_bo_subset) / 2  #count the total # rows and divide by 2
sample_01 <- t_bo_subset[sample(nrow(t_box_office), n_samples, replace = TRUE), ]
sample_02 <- t_bo_subset[sample(nrow(t_box_office), n_samples, replace = TRUE), ]
sample_03 <- t_bo_subset[sample(nrow(t_box_office), n_samples, replace = TRUE), ]
sample_04 <- t_bo_subset[sample(nrow(t_box_office), n_samples, replace = TRUE), ]
sample_05 <- t_bo_subset[sample(nrow(t_box_office), n_samples, replace = TRUE), ]
#validate that the samples are different
head(sample_01)
## # A tibble: 6 × 5
##   `$Worldwide` `$Domestic` `$Foreign` Vote_Count Rating_of_10
##          <dbl>       <dbl>      <dbl>      <dbl>        <dbl>
## 1     33821338       25994   33795344          0         0   
## 2     85512300    32000304   53511996       5038         6.2 
## 3     50444358    40713082    9731276       1070         7.63
## 4      7361414     5600000    1761414         30         7   
## 5     16726510      721439   16005071        117         6.47
## 6     15439299           0   15439299        655         6.44
head(sample_02)
## # A tibble: 6 × 5
##   `$Worldwide` `$Domestic` `$Foreign` Vote_Count Rating_of_10
##          <dbl>       <dbl>      <dbl>      <dbl>        <dbl>
## 1     49830607    27779426   22051181       1884         6.23
## 2    193967670    99967670   94000000       1948         5.86
## 3    187281115   110359362   76921753       2939         5.91
## 4     20226058     4050103   16175955        723         7.45
## 5     34963395           0   34963395        100         7.16
## 6     20648328    10330853   10317475       2370         6.57
head(sample_03)
## # A tibble: 6 × 5
##   `$Worldwide` `$Domestic` `$Foreign` Vote_Count Rating_of_10
##          <dbl>       <dbl>      <dbl>      <dbl>        <dbl>
## 1    176070171   106128601   69941570       5517         6.8 
## 2    587204668   187168425  400036243       7067         6.92
## 3    100572044    74541707   26030337        779         6.4 
## 4     32001580           0   32001580          5         5.2 
## 5     49619118           0   49619118       2965         7.74
## 6     25948636           0   25948636          7         5
head(sample_04)
## # A tibble: 6 × 5
##   `$Worldwide` `$Domestic` `$Foreign` Vote_Count Rating_of_10
##          <dbl>       <dbl>      <dbl>      <dbl>        <dbl>
## 1     29144802    19765868    9378934        341         6.1 
## 2     55611001    44875481   10735520       1492         5.56
## 3    250425512    75030163  175395349       3248         5.11
## 4    215905815    58568815  157337000       3836         7.5 
## 5    181732879           0  181732879         41         3.9 
## 6     25030264           0   25030264         96         6.4
head(sample_05)
## # A tibble: 6 × 5
##   `$Worldwide` `$Domestic` `$Foreign` Vote_Count Rating_of_10
##          <dbl>       <dbl>      <dbl>      <dbl>        <dbl>
## 1     81244605    24149393   57095212       2620         6.63
## 2     10308627           0   10308627         17         6.3 
## 3    487637474     1721446  485916028         78         6.51
## 4    176506819    66486205  110020614      10167         6.89
## 5    125427681    73921000   51506681       1211         6.22
## 6     30932534           0   30932534        183         6.3

Task 2 - Scrutinize the sub-samples

Run summaries for the $value columns

summary(sample_01[c("$Worldwide", "$Foreign", "$Domestic")])   #Examing $ summary data for this sample
##    $Worldwide           $Foreign           $Domestic        
##  Min.   :1.699e+06   Min.   :0.000e+00   Min.   :        0  
##  1st Qu.:2.402e+07   1st Qu.:1.230e+07   1st Qu.:    10854  
##  Median :4.926e+07   Median :3.054e+07   Median : 17576612  
##  Mean   :1.223e+08   Mean   :7.718e+07   Mean   : 45141398  
##  3rd Qu.:1.240e+08   3rd Qu.:7.525e+07   3rd Qu.: 54851604  
##  Max.   :2.744e+09   Max.   :1.994e+09   Max.   :936662225
summary(sample_02[c("$Worldwide", "$Foreign", "$Domestic")])   #Examing $ summary data for this sample
##    $Worldwide           $Foreign           $Domestic        
##  Min.   :1.666e+06   Min.   :0.000e+00   Min.   :        0  
##  1st Qu.:2.393e+07   1st Qu.:1.350e+07   1st Qu.:    74054  
##  Median :4.920e+07   Median :3.033e+07   Median : 17613046  
##  Mean   :1.177e+08   Mean   :7.385e+07   Mean   : 43856805  
##  3rd Qu.:1.201e+08   3rd Qu.:7.214e+07   3rd Qu.: 53563168  
##  Max.   :2.799e+09   Max.   :1.941e+09   Max.   :858373000
summary(sample_03[c("$Worldwide", "$Foreign", "$Domestic")])   #Examing $ summary data for this sample
##    $Worldwide           $Foreign           $Domestic        
##  Min.   :1.666e+06   Min.   :0.000e+00   Min.   :        0  
##  1st Qu.:2.483e+07   1st Qu.:1.353e+07   1st Qu.:    21426  
##  Median :4.811e+07   Median :2.978e+07   Median : 18244060  
##  Mean   :1.161e+08   Mean   :7.198e+07   Mean   : 44118316  
##  3rd Qu.:1.125e+08   3rd Qu.:6.420e+07   3rd Qu.: 52557126  
##  Max.   :2.744e+09   Max.   :1.994e+09   Max.   :749766139
summary(sample_04[c("$Worldwide", "$Foreign", "$Domestic")])   #Examing $ summary data for this sample
##    $Worldwide           $Foreign           $Domestic        
##  Min.   :1.699e+06   Min.   :0.000e+00   Min.   :        0  
##  1st Qu.:2.522e+07   1st Qu.:1.372e+07   1st Qu.:   131058  
##  Median :4.872e+07   Median :3.046e+07   Median : 18336476  
##  Mean   :1.199e+08   Mean   :7.518e+07   Mean   : 44741012  
##  3rd Qu.:1.189e+08   3rd Qu.:7.400e+07   3rd Qu.: 54361307  
##  Max.   :2.799e+09   Max.   :1.994e+09   Max.   :858373000
summary(sample_05[c("$Worldwide", "$Foreign", "$Domestic")])   #Examing $ summary data for this sample
##    $Worldwide           $Foreign           $Domestic        
##  Min.   :1.666e+06   Min.   :0.000e+00   Min.   :        0  
##  1st Qu.:2.410e+07   1st Qu.:1.329e+07   1st Qu.:   203724  
##  Median :4.859e+07   Median :2.910e+07   Median : 18346039  
##  Mean   :1.247e+08   Mean   :7.831e+07   Mean   : 46423473  
##  3rd Qu.:1.159e+08   3rd Qu.:7.036e+07   3rd Qu.: 54731865  
##  Max.   :2.744e+09   Max.   :1.994e+09   Max.   :936662225

Run summaries for the other numeric columns

summary(sample_01[c("Vote_Count", "Rating_of_10")])      #Examining vote_counts and Ratings for this sample
##    Vote_Count     Rating_of_10  
##  Min.   :    0   Min.   :0.000  
##  1st Qu.:  192   1st Qu.:6.000  
##  Median : 1010   Median :6.524  
##  Mean   : 2545   Mean   :6.486  
##  3rd Qu.: 3126   3rd Qu.:7.100  
##  Max.   :36753   Max.   :9.700  
##  NA's   :92      NA's   :92
summary(sample_02[c("Vote_Count", "Rating_of_10")])      #Examining vote_counts and Ratings for this sample
##    Vote_Count     Rating_of_10  
##  Min.   :    0   Min.   :0.000  
##  1st Qu.:  215   1st Qu.:6.000  
##  Median : 1068   Median :6.582  
##  Mean   : 2433   Mean   :6.463  
##  3rd Qu.: 3097   3rd Qu.:7.100  
##  Max.   :35934   Max.   :9.000  
##  NA's   :88      NA's   :88
summary(sample_03[c("Vote_Count", "Rating_of_10")])      #Examining vote_counts and Ratings for this sample
##    Vote_Count     Rating_of_10  
##  Min.   :    0   Min.   :0.000  
##  1st Qu.:  203   1st Qu.:6.000  
##  Median : 1036   Median :6.600  
##  Mean   : 2531   Mean   :6.487  
##  3rd Qu.: 3046   3rd Qu.:7.136  
##  Max.   :33108   Max.   :9.000  
##  NA's   :87      NA's   :87
summary(sample_04[c("Vote_Count", "Rating_of_10")])      #Examining vote_counts and Ratings for this sample
##    Vote_Count       Rating_of_10  
##  Min.   :    0.0   Min.   :0.000  
##  1st Qu.:  212.8   1st Qu.:6.024  
##  Median : 1023.5   Median :6.599  
##  Mean   : 2550.8   Mean   :6.517  
##  3rd Qu.: 2887.0   3rd Qu.:7.152  
##  Max.   :35934.0   Max.   :9.000  
##  NA's   :84        NA's   :84
summary(sample_05[c("Vote_Count", "Rating_of_10")])      #Examining vote_counts and Ratings for this sample
##    Vote_Count       Rating_of_10  
##  Min.   :    0.0   Min.   :0.000  
##  1st Qu.:  222.5   1st Qu.:6.000  
##  Median : 1055.0   Median :6.587  
##  Mean   : 2695.7   Mean   :6.477  
##  3rd Qu.: 3111.0   3rd Qu.:7.100  
##  Max.   :35934.0   Max.   :8.600  
##  NA's   :92        NA's   :92

Questions about the Sub-Samples

How different are they?

all the data sets are different but they are in the same general ranges.

Looking at $Worldwide minimum and maximum data as a sample:

  • The Minimum ranges from 1.666e+06 to 1.709e+06 1.709M for a difference of 43k

  • The Maximum ranges from 2.320e+09 - 2.799e+09 or 2.799B

What would you have called an anomaly in one sub-sample that you wouldn’t in another?

The maximum Domestic box office is significantly lower in Sample_05 at $684,075,767

Are there aspects of the data that are consistent among all sub-samples?

The minimum rating was consistently 0, as was the minimum # of votes. This suggests the data set was large enough that there were significant numbers of zero votes and ratings. The NA values bear this out as they showed 80 to 101