The purpose of this week’s data dive is for you to think critically about what might go wrong when it comes time to make conclusions about your data.
Your RMarkdown notebook for this data dive should contain the following:
A collection of 5 random samples of your data (with replacement). We are simulating the act of collecting data (like yours) from a population where the simulated “population” here is represented by the data set you already have.
Each subsample should be as long as roughly 50% percent of your data.
Store each sample set in a separate data frame (e.g.,
df_2
might be the second of these samples).
Of course, these subsamples should each include both categorical and continuous (numeric) data.
Scrutinize these subsamples. Note: you might find
group_by
quite helpful here!
How different are they?
What would you have called an anomaly in one sub-sample that you wouldn’t in another?
Are there aspects of the data that are consistent among all sub-samples?
Consider how this investigation affects how you might draw conclusions about the data in the future.
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
Note that the echo = FALSE
parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
library(tidyverse) #load the tidyverse library}
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales) #load the scales package for use with currency formatting
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(readr)
t_box_office <- read_csv("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv")
## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(t_box_office)[2] <- "Movie_Name" #change the ambiguous name "Release Group" to the more explicit "Movie_Name"
head(t_box_office)
## # A tibble: 6 × 17
## Rank Movie_Name `$Worldwide` `$Domestic` `Domestic %` `$Foreign` `Foreign %`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Mission: I… 546388108 215409889 39.4 330978219 60.6
## 2 2 Gladiator 460583960 187705427 40.8 272878533 59.2
## 3 3 Cast Away 429632142 233632142 54.4 196000000 45.6
## 4 4 What Women… 374111707 182811707 48.9 191300000 51.1
## 5 5 Dinosaur 349822765 137748063 39.4 212074702 60.6
## 6 6 How the Gr… 345842198 260745620 75.4 85096578 24.6
## # ℹ 10 more variables: Year <dbl>, Genres <chr>, Rating <chr>,
## # Vote_Count <dbl>, Original_Language <chr>, Production_Countries <chr>,
## # Prime_Genre <chr>, Prime_Production_Country <chr>, Rating_scale <dbl>,
## # Rating_of_10 <dbl>
Through initial data manipulations I found that the format for the currency columns (or default formatting) was distracting and difficult to follow. I determined that cleaning this up at the load stage of the data set would make following and digesting the data easier. Additionally, the name of the movies is titled “release Group” in the original data set. The documentation did not make it clear why this was selected. Again for clarity I determined that adjusting the column name to Movie_Name would assist in clarity during the analysis tasks to follow.
I had to do some adjustments to the data as when I was trying to run summary data the columns labeled with special characters were by default read as text. With the parse_number() function from the readr package this problem is solved.
Now that I have more familiarity with the data I can start posing some questions that I’d like to ask of the data at some point.
Is there any correlation between the score of a movie and it’s gross earnings? How about by Genre or Production Country?
Can we establish a trust factor for the score based on the # of votes used for each score?
Would it also be clearer to adjust the columns refered to as “rating” to “score” or something similar as rating has a different connotation in movies?
Does the primary Production country suggest anything about a movies success at earnings.
I also noticed that there is a significant gap in the data as it does not include the budget for the movies. This might create a bias as success is only based on the Gross earnings and ratings where typically success is also based on earnings vs budget.
#Create a subset of the data with just the numeric data
t_bo_subset <- t_box_office[, c("$Worldwide", "$Domestic", "$Foreign", "Vote_Count", "Rating_of_10")]
head(t_bo_subset)
## # A tibble: 6 × 5
## `$Worldwide` `$Domestic` `$Foreign` Vote_Count Rating_of_10
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 546388108 215409889 330978219 6741 6.13
## 2 460583960 187705427 272878533 19032 8.22
## 3 429632142 233632142 196000000 11403 7.66
## 4 374111707 182811707 191300000 3944 6.45
## 5 349822765 137748063 212074702 2530 6.54
## 6 345842198 260745620 85096578 7591 6.8
#create the 5 sample sets
n_samples <- nrow(t_bo_subset) / 2 #count the total # rows and divide by 2
sample_01 <- t_bo_subset[sample(nrow(t_box_office), n_samples, replace = TRUE), ]
sample_02 <- t_bo_subset[sample(nrow(t_box_office), n_samples, replace = TRUE), ]
sample_03 <- t_bo_subset[sample(nrow(t_box_office), n_samples, replace = TRUE), ]
sample_04 <- t_bo_subset[sample(nrow(t_box_office), n_samples, replace = TRUE), ]
sample_05 <- t_bo_subset[sample(nrow(t_box_office), n_samples, replace = TRUE), ]
#validate that the samples are different
head(sample_01)
## # A tibble: 6 × 5
## `$Worldwide` `$Domestic` `$Foreign` Vote_Count Rating_of_10
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 33821338 25994 33795344 0 0
## 2 85512300 32000304 53511996 5038 6.2
## 3 50444358 40713082 9731276 1070 7.63
## 4 7361414 5600000 1761414 30 7
## 5 16726510 721439 16005071 117 6.47
## 6 15439299 0 15439299 655 6.44
head(sample_02)
## # A tibble: 6 × 5
## `$Worldwide` `$Domestic` `$Foreign` Vote_Count Rating_of_10
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 49830607 27779426 22051181 1884 6.23
## 2 193967670 99967670 94000000 1948 5.86
## 3 187281115 110359362 76921753 2939 5.91
## 4 20226058 4050103 16175955 723 7.45
## 5 34963395 0 34963395 100 7.16
## 6 20648328 10330853 10317475 2370 6.57
head(sample_03)
## # A tibble: 6 × 5
## `$Worldwide` `$Domestic` `$Foreign` Vote_Count Rating_of_10
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 176070171 106128601 69941570 5517 6.8
## 2 587204668 187168425 400036243 7067 6.92
## 3 100572044 74541707 26030337 779 6.4
## 4 32001580 0 32001580 5 5.2
## 5 49619118 0 49619118 2965 7.74
## 6 25948636 0 25948636 7 5
head(sample_04)
## # A tibble: 6 × 5
## `$Worldwide` `$Domestic` `$Foreign` Vote_Count Rating_of_10
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 29144802 19765868 9378934 341 6.1
## 2 55611001 44875481 10735520 1492 5.56
## 3 250425512 75030163 175395349 3248 5.11
## 4 215905815 58568815 157337000 3836 7.5
## 5 181732879 0 181732879 41 3.9
## 6 25030264 0 25030264 96 6.4
head(sample_05)
## # A tibble: 6 × 5
## `$Worldwide` `$Domestic` `$Foreign` Vote_Count Rating_of_10
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 81244605 24149393 57095212 2620 6.63
## 2 10308627 0 10308627 17 6.3
## 3 487637474 1721446 485916028 78 6.51
## 4 176506819 66486205 110020614 10167 6.89
## 5 125427681 73921000 51506681 1211 6.22
## 6 30932534 0 30932534 183 6.3
summary(sample_01[c("$Worldwide", "$Foreign", "$Domestic")]) #Examing $ summary data for this sample
## $Worldwide $Foreign $Domestic
## Min. :1.699e+06 Min. :0.000e+00 Min. : 0
## 1st Qu.:2.402e+07 1st Qu.:1.230e+07 1st Qu.: 10854
## Median :4.926e+07 Median :3.054e+07 Median : 17576612
## Mean :1.223e+08 Mean :7.718e+07 Mean : 45141398
## 3rd Qu.:1.240e+08 3rd Qu.:7.525e+07 3rd Qu.: 54851604
## Max. :2.744e+09 Max. :1.994e+09 Max. :936662225
summary(sample_02[c("$Worldwide", "$Foreign", "$Domestic")]) #Examing $ summary data for this sample
## $Worldwide $Foreign $Domestic
## Min. :1.666e+06 Min. :0.000e+00 Min. : 0
## 1st Qu.:2.393e+07 1st Qu.:1.350e+07 1st Qu.: 74054
## Median :4.920e+07 Median :3.033e+07 Median : 17613046
## Mean :1.177e+08 Mean :7.385e+07 Mean : 43856805
## 3rd Qu.:1.201e+08 3rd Qu.:7.214e+07 3rd Qu.: 53563168
## Max. :2.799e+09 Max. :1.941e+09 Max. :858373000
summary(sample_03[c("$Worldwide", "$Foreign", "$Domestic")]) #Examing $ summary data for this sample
## $Worldwide $Foreign $Domestic
## Min. :1.666e+06 Min. :0.000e+00 Min. : 0
## 1st Qu.:2.483e+07 1st Qu.:1.353e+07 1st Qu.: 21426
## Median :4.811e+07 Median :2.978e+07 Median : 18244060
## Mean :1.161e+08 Mean :7.198e+07 Mean : 44118316
## 3rd Qu.:1.125e+08 3rd Qu.:6.420e+07 3rd Qu.: 52557126
## Max. :2.744e+09 Max. :1.994e+09 Max. :749766139
summary(sample_04[c("$Worldwide", "$Foreign", "$Domestic")]) #Examing $ summary data for this sample
## $Worldwide $Foreign $Domestic
## Min. :1.699e+06 Min. :0.000e+00 Min. : 0
## 1st Qu.:2.522e+07 1st Qu.:1.372e+07 1st Qu.: 131058
## Median :4.872e+07 Median :3.046e+07 Median : 18336476
## Mean :1.199e+08 Mean :7.518e+07 Mean : 44741012
## 3rd Qu.:1.189e+08 3rd Qu.:7.400e+07 3rd Qu.: 54361307
## Max. :2.799e+09 Max. :1.994e+09 Max. :858373000
summary(sample_05[c("$Worldwide", "$Foreign", "$Domestic")]) #Examing $ summary data for this sample
## $Worldwide $Foreign $Domestic
## Min. :1.666e+06 Min. :0.000e+00 Min. : 0
## 1st Qu.:2.410e+07 1st Qu.:1.329e+07 1st Qu.: 203724
## Median :4.859e+07 Median :2.910e+07 Median : 18346039
## Mean :1.247e+08 Mean :7.831e+07 Mean : 46423473
## 3rd Qu.:1.159e+08 3rd Qu.:7.036e+07 3rd Qu.: 54731865
## Max. :2.744e+09 Max. :1.994e+09 Max. :936662225
summary(sample_01[c("Vote_Count", "Rating_of_10")]) #Examining vote_counts and Ratings for this sample
## Vote_Count Rating_of_10
## Min. : 0 Min. :0.000
## 1st Qu.: 192 1st Qu.:6.000
## Median : 1010 Median :6.524
## Mean : 2545 Mean :6.486
## 3rd Qu.: 3126 3rd Qu.:7.100
## Max. :36753 Max. :9.700
## NA's :92 NA's :92
summary(sample_02[c("Vote_Count", "Rating_of_10")]) #Examining vote_counts and Ratings for this sample
## Vote_Count Rating_of_10
## Min. : 0 Min. :0.000
## 1st Qu.: 215 1st Qu.:6.000
## Median : 1068 Median :6.582
## Mean : 2433 Mean :6.463
## 3rd Qu.: 3097 3rd Qu.:7.100
## Max. :35934 Max. :9.000
## NA's :88 NA's :88
summary(sample_03[c("Vote_Count", "Rating_of_10")]) #Examining vote_counts and Ratings for this sample
## Vote_Count Rating_of_10
## Min. : 0 Min. :0.000
## 1st Qu.: 203 1st Qu.:6.000
## Median : 1036 Median :6.600
## Mean : 2531 Mean :6.487
## 3rd Qu.: 3046 3rd Qu.:7.136
## Max. :33108 Max. :9.000
## NA's :87 NA's :87
summary(sample_04[c("Vote_Count", "Rating_of_10")]) #Examining vote_counts and Ratings for this sample
## Vote_Count Rating_of_10
## Min. : 0.0 Min. :0.000
## 1st Qu.: 212.8 1st Qu.:6.024
## Median : 1023.5 Median :6.599
## Mean : 2550.8 Mean :6.517
## 3rd Qu.: 2887.0 3rd Qu.:7.152
## Max. :35934.0 Max. :9.000
## NA's :84 NA's :84
summary(sample_05[c("Vote_Count", "Rating_of_10")]) #Examining vote_counts and Ratings for this sample
## Vote_Count Rating_of_10
## Min. : 0.0 Min. :0.000
## 1st Qu.: 222.5 1st Qu.:6.000
## Median : 1055.0 Median :6.587
## Mean : 2695.7 Mean :6.477
## 3rd Qu.: 3111.0 3rd Qu.:7.100
## Max. :35934.0 Max. :8.600
## NA's :92 NA's :92
all the data sets are different but they are in the same general ranges.
Looking at $Worldwide minimum and maximum data as a sample:
The Minimum ranges from 1.666e+06 to 1.709e+06 1.709M for a difference of 43k
The Maximum ranges from 2.320e+09 - 2.799e+09 or 2.799B
The maximum Domestic box office is significantly lower in Sample_05 at $684,075,767
The minimum rating was consistently 0, as was the minimum # of votes. This suggests the data set was large enough that there were significant numbers of zero votes and ratings. The NA values bear this out as they showed 80 to 101