rm(list = ls())
#load the data libraries - remove or add as needed
library(tidyverse) #tools form data science, included ggplot2, dplyr, tidyr, readr, tibble, stringr, and forcats as core libraries.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales) #loaded to address viz issues, including currency issues
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
options(scipen=999) #disable scientific notation since high values are used
#load the adjusted version of the csv from the local desktop
t_box_office <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv", delim = ",")
## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#pull colnames from data set
colnames(t_box_office)
## [1] "Rank" "Release Group"
## [3] "$Worldwide" "$Domestic"
## [5] "Domestic %" "$Foreign"
## [7] "Foreign %" "Year"
## [9] "Genres" "Rating"
## [11] "Vote_Count" "Original_Language"
## [13] "Production_Countries" "Prime_Genre"
## [15] "Prime_Production_Country" "Rating_scale"
## [17] "Rating_of_10"
# Print column names one per row
movies <- t_box_office
cat(colnames(movies), sep = "\n") #learned how to properly use the cat() function
## Rank
## Release Group
## $Worldwide
## $Domestic
## Domestic %
## $Foreign
## Foreign %
## Year
## Genres
## Rating
## Vote_Count
## Original_Language
## Production_Countries
## Prime_Genre
## Prime_Production_Country
## Rating_scale
## Rating_of_10
The purpose of this week’s data dive is for you to think critically about the importance of documenting your model, but also the importance of referencing the documentation for the data you’re using.
Choose two numeric variables, and pair each one with a column you built (i.e., calculated based on others)
So, you should have two pairs of columns (1 original column, and 1 created/“mutated” column)
All variables for this data dive should be either continuous (i.e., numeric) or ordered (e.g.,
['small', 'medium', 'large']
is okay, but["apples", "oranges", "bananas"]
is not)At least one pair should be a response variable and an explanatory variable
Let’s decide what variables to use.
My original rows are $Domestic, and Vote_Count.
For $Domestic I am going to create a variable called Foreign_to_Domestic_Ratio.
$Domestic is the dollars made domestically by the movie
Foreign_to_Domestic_Ratio represents how a movie did in foreign market compared to a domestic market, a value larger than 1 shows the movie did better internationally than it did domestically, while a number less than 1 shows the movie did better domestically.
Here’s the code to add the new column
#add the new column foreign_to_domestic_ratio
movies$foreign_to_domestic_ratio <- movies$'$Foreign' / movies$'$Domestic'
head(movies) #check to make sure the column got added
## # A tibble: 6 × 18
## Rank `Release Group` `$Worldwide` `$Domestic` `Domestic %` `$Foreign`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 Mission: Impossible II 546388108 215409889 39.4 330978219
## 2 2 Gladiator 460583960 187705427 40.8 272878533
## 3 3 Cast Away 429632142 233632142 54.4 196000000
## 4 4 What Women Want 374111707 182811707 48.9 191300000
## 5 5 Dinosaur 349822765 137748063 39.4 212074702
## 6 6 How the Grinch Stole C… 345842198 260745620 75.4 85096578
## # ℹ 12 more variables: `Foreign %` <dbl>, Year <dbl>, Genres <chr>,
## # Rating <chr>, Vote_Count <dbl>, Original_Language <chr>,
## # Production_Countries <chr>, Prime_Genre <chr>,
## # Prime_Production_Country <chr>, Rating_scale <dbl>, Rating_of_10 <dbl>,
## # foreign_to_domestic_ratio <dbl>
Next is Vote_Count
To pair with Vote_Count I am creating a variable called popularity_index
Vote_Count is the number of votes a movie recieved.
popularity_index calculates the number of votes times the average score, suggesting the popularity of a movie based on it’s score and how often it was voted for. this might be interesting when grouping movies in to categories based on the range of votes.
#add the column popularity_index
movies$popularity_index <- movies$Vote_Count * movies$Rating_of_10
head(movies) #check to make sure the column got added
## # A tibble: 6 × 19
## Rank `Release Group` `$Worldwide` `$Domestic` `Domestic %` `$Foreign`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 Mission: Impossible II 546388108 215409889 39.4 330978219
## 2 2 Gladiator 460583960 187705427 40.8 272878533
## 3 3 Cast Away 429632142 233632142 54.4 196000000
## 4 4 What Women Want 374111707 182811707 48.9 191300000
## 5 5 Dinosaur 349822765 137748063 39.4 212074702
## 6 6 How the Grinch Stole C… 345842198 260745620 75.4 85096578
## # ℹ 13 more variables: `Foreign %` <dbl>, Year <dbl>, Genres <chr>,
## # Rating <chr>, Vote_Count <dbl>, Original_Language <chr>,
## # Production_Countries <chr>, Prime_Genre <chr>,
## # Prime_Production_Country <chr>, Rating_scale <dbl>, Rating_of_10 <dbl>,
## # foreign_to_domestic_ratio <dbl>, popularity_index <dbl>
Plot a visualization for each relationship, and draw some conclusions based on the plot
- Use what we’ve covered so far in class to scrutinize the plot (e.g., are there any outliers?)
# Scatter plot for $domestic vs. foreign_to_domestic_ratio
ggplot(movies, aes(x = foreign_to_domestic_ratio, y = `$Domestic`)) +
geom_point() +
labs(title = "$Domestic vs. Foreign to Domestic Ratio",
x = "Foreign to Domestic Ratio",
y = "$Domestic Revenue") +
theme_minimal()
This plot doesn’t seem helpful as all the data is on the axes. there seem to be definite outliers.
Maybe because there is a wide variation in values? Trying the same with a logarithmic scale.
# Scatter plot with logarithmic scales
ggplot(movies, aes(x = foreign_to_domestic_ratio, y = `$Domestic`)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
labs(title = "$Domestic vs. Foreign to Domestic Ratio (Log Scale)",
x = "Foreign to Domestic Ratio (Log Scale)",
y = "$Domestic Revenue (Log Scale)") +
theme_minimal()
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
This graph seems more helpful but there is still a large cluster of data in the center and it’s hard to say how much overlap is there. There is a very interesting group of outliers on the 0 x axis between the million and 100,000,000 million range.
This is more of an observation than insight. A challenge I have had with this data set is the wide range of values. When I was researching how to address this, the suggestion to use the log scale was very prevalent. I like the concept of this and how it can, in effect, standardize the data across a similar scale. I understand why it works, but it will take some more work on my part before it feels like a natural approach.
# Scatter plot for vote_count vs. popularity_index
ggplot(movies, aes(x = popularity_index, y = Vote_Count)) +
geom_point() +
labs(title = "Vote Count vs. Popularity Index",
x = "Popularity Index",
y = "Vote Count") +
theme_minimal()
## Warning: Removed 170 rows containing missing values or values outside the scale range
## (`geom_point()`).
Calculate the appropriate correlation coefficient for each of these combinations
- Explain why the value makes sense (or doesn’t) based on the visualization(s)
Figuring the correlation coefficient for the $Domestic to the foreign_to_domestic_ratio
#first try
domestic <- as.numeric(movies$`$Domestic`)
foreign_ratio <- as.numeric(movies$foreign_to_domestic_ratio)
cor(domestic, foreign_ratio, use="complete.obs")
## [1] NaN
This calculation gave me difficulty. It wasn’t until I looked at the data for the ratio that I realized I was getting not a number results and had to find a way to account for them.
# Convert to numeric
domestic <- as.numeric(movies$`$Domestic`)
foreign_ratio <- as.numeric(movies$foreign_to_domestic_ratio)
# Remove NA and infinite values
valid_indices <- !is.na(domestic) & !is.na(foreign_ratio) & is.finite(foreign_ratio)
# Calculate correlation on filtered data
cor(domestic[valid_indices], foreign_ratio[valid_indices], use="complete.obs")
## [1] -0.06256125
The correlation is near zero. This suggests there is almost no correlation between the Domestic earnings and the ratio between Foreign and domestic earnings. If that ratio changes it doesn’t seem to have any impact on the Domestic earnings. If there is any correlation it is slightly in the downward trend but so insignificant to be almost meaningless. This is interesting as the logarithm version of the graph above suggests more of a negative trend.
Now, to do the same with Vote_Count and the popularity index.
# Calculate correlation on filtered data
votes <- as.numeric(movies$Vote_Count)
popularity <- as.numeric(movies$popularity_index)
cor(votes, popularity, use="complete.obs")
## [1] 0.9944925
The correlation is very strongly positive. There appears to be a definite correlation between the Vote Count and the popularity index. The graph clearly bears this out as well. This one makes perfect sense.
Build a confidence interval for each of the response variable(s). Provide a detailed conclusion of the response variable (i.e., the population) based on your confidence interval.
# Simplify the variable to avoid typing issues
dom_dollars <- movies$`$Domestic`
# Calculate sample mean
mean_domestic <- mean(dom_dollars)
#print(paste("Sample Mean:", mean_domestic))
# Calculate sample standard deviation
sd_domestic <- sd(dom_dollars)
#print(paste("Sample Standard Deviation:", sd_domestic))
# Calculate sample size
n_domestic <- length(dom_dollars)
#print(paste("Sample Size:", n_domestic))
# Calculate standard error
se_domestic <- sd_domestic / sqrt(n_domestic)
#print(paste("Standard Error:", se_domestic))
# Determine the critical value for 95% confidence interval
alpha <- 0.05
t_critical_domestic <- qt(1 - alpha/2, df = n_domestic - 1)
#print(paste("Critical Value:", t_critical_domestic))
# Calculate the margin of error
margin_of_error_domestic <- t_critical_domestic * se_domestic
#print(paste("Margin of Error:", margin_of_error_domestic))
# Construct the confidence interval
lower_bound_domestic <- mean_domestic - margin_of_error_domestic
upper_bound_domestic <- mean_domestic + margin_of_error_domestic
# Print the confidence interval
cat("95% Confidence Interval for Domestic Earnings: [", lower_bound_domestic, ", ", upper_bound_domestic, "]\n")
## 95% Confidence Interval for Domestic Earnings: [ 42582931 , 46867535 ]
Keeping your R environment clean and being careful with variable usage is highly important. My code above uses multiple print statements as I was getting inconsistent results when trying to build this code and then repeat it for the next variable pair. This turns out to be a good trouble shooting techinique as it helped me discover that I had reused a variable. Additionally it made it easier to see that I had made an error with the script below by forgetting to add the na.rm = TRUE argument. I kept getting NA values and could not locate the issue until I applied this troubleshooting technique.
The 95% confidence interval is between $42,582,931 and $46,867,535,
so we are 95% confident that the true mean of domestic earnings falls
within this range, somewhere around $44 Million dollars. This is a range
of around $4 Million which in context is not a very large range. This
suggests that the movies in this data set perform similarly in terms of
domestic earnings. Given this data, it is likely that in the future
movies that rate in the top 200 will show similar earnings.
vote_cnt <- movies$Vote_Count
# Calculate sample mean
mean_votes <- mean(vote_cnt, na.rm = TRUE)
#print(paste("Sample Mean:", mean_votes))
# Calculate sample standard deviation
sd_votes <- sd(vote_cnt, na.rm = TRUE)
# Calculate sample size
n_votes <- length(vote_cnt)
# Calculate standard error
se_votes <- sd_votes / sqrt(n_votes)
# Determine the critical value for 95% confidence interval
alpha <- 0.05
t_critical_votes <- qt(1 - alpha/2, df = n_votes - 1)
# Calculate the margin of error
margin_of_error_votes <- t_critical_votes * se_votes
# Construct the confidence interval
lower_bound_votes <- mean_votes - margin_of_error_votes
upper_bound_votes <- mean_votes + margin_of_error_votes
# Print the confidence interval
cat("95% Confidence Interval for Vote_Count: [", lower_bound_votes, ", ", upper_bound_votes, "]\n")
## 95% Confidence Interval for Vote_Count: [ 2422.176 , 2640.996 ]
The 95% confidence interval is between 2422 and 2640, so we are 95% confident that the true mean number of votes for a movie that scores in the top 200 falls within this range, somewhere around 2500 votes. The interval of 218 votes is narrow . This suggests that the movies in this data set generally get similar numbers of votes. Given this data, it is likely that in the future movies that rate in the top 200 will recieve similar numbers of votes.