setwd(“C:\Users\amahm\Downloads\ITEC4220”)
1: Importing data into r studio and calling it data.
data <- read.csv("movies dataset.csv")
Creating a new variable named filtered data
Shows movies who’s runtime is 240 minutes or less.
filtered_data <- data[data\(runtime
<= 240 & data\)runtime > 0, ]
1: Histogram
hist(filtered_data$runtime,
main = "Distribution of Runtime",
xlab = "Runtime (minutes)",
col = "lightblue",
border = "black",
breaks = 5)
Distribution of runtime
What it does:
the hist command takes column run time from movies dataset.csv and
maps out a distribution of the total run time of all movies onto a
histogram. the histogram is set up to have 5 breaks which shows run time
in minutes while the y-axis shows the frequency of movies.
2: Mean runtime
mean(filtered_data$runtime)
[1] 62.34583
This means that the average runtime of all the movies in the dataset
is one hour and five minutes.
3: corelation test
This test uses the corelation function to see if there is a
corelation between runtime and revenue of a movie.
cor.test(filtered_data$runtime, filtered_data$revenue, method = "pearson") # if roughly normal
After performing a correlation test, I got a confidence interval of
0.055. Since the number is close to almost being zero, it states that
revenue and runtime are not completely independent, but there is no real
correlation between runtime and revenue.
Before running the correlation test, I thought that there would be a
relationship between runtime and revenue with my original thought
process being that longer movies usually make more money. After running
the corelation test, my hypothesis was proven wrong.
4:
hist((filtered_data$budget[filtered_data$budget <= 450000000]) / 1e6,
main = "Distribution of Budget",
xlab = "Budget (millions)",
col = "lightblue",
border = "black",
breaks = 3) # you can adjust this
Distribution of Budget
I think that all the results are bunched into the first 3 baskets
because some of the movies in the dataset have 0 for the budget for some
reason. But this does not look to unaccurate because most movies busgets
are between 100 to 200 million dollars, with some even having less.
5:
Create budget groups
filtered_data$budget_group <- cut(
filtered_data$budget,
breaks = c(0, 50000000, 150000000, Inf), # <50M, 50M-150M, >150M
labels = c("Low", "Medium", "High")
)
Run ANOVA
anova_model <- aov(revenue ~ budget_group, data = filtered_data)
summary(anova_model)
Result:
> summary(anova_model)
Df Sum Sq Mean Sq F value Pr(>F)
budget_group 2 1.400e+20 7.002e+19 18717 <2e-16 ***
Residuals 56526 2.114e+20 3.741e+15
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
835244 observations deleted due to missingness
Summary:
Since the p-value is close to being zero, it means that movies
grouped by budget have significantly different mean revenues. Budget
really does matter for revenue, at least statistically in my
dataset.
Tukey Honest Significant Difference post-hoc test
tukey_results <- TukeyHSD(anova_model)
##tukey_results
After running the Tukey test, results show that movies with high
budgets earn significantly more revenue.