setwd(“C:\Users\amahm\Downloads\ITEC4220”)

1: Importing data into r studio and calling it data.

data <- read.csv("movies dataset.csv")

Creating a new variable named filtered data

Shows movies who’s runtime is 240 minutes or less.

filtered_data <- data[data\(runtime <= 240 & data\)runtime > 0, ]

1: Histogram

hist(filtered_data$runtime,
     main = "Distribution of Runtime",
     xlab = "Runtime (minutes)",
     col = "lightblue",
     border = "black",
     breaks = 5)

Distribution of runtime

What it does:

the hist command takes column run time from movies dataset.csv and maps out a distribution of the total run time of all movies onto a histogram. the histogram is set up to have 5 breaks which shows run time in minutes while the y-axis shows the frequency of movies.

2: Mean runtime

mean(filtered_data$runtime)
[1] 62.34583

This means that the average runtime of all the movies in the dataset is one hour and five minutes.

3: corelation test

This test uses the corelation function to see if there is a corelation between runtime and revenue of a movie.

cor.test(filtered_data$runtime, filtered_data$revenue, method = "pearson")   # if roughly normal

After performing a correlation test, I got a confidence interval of 0.055. Since the number is close to almost being zero, it states that revenue and runtime are not completely independent, but there is no real correlation between runtime and revenue.

Before running the correlation test, I thought that there would be a relationship between runtime and revenue with my original thought process being that longer movies usually make more money. After running the corelation test, my hypothesis was proven wrong.

4:

hist((filtered_data$budget[filtered_data$budget <= 450000000]) / 1e6,
     main = "Distribution of Budget",
     xlab = "Budget (millions)",
     col = "lightblue",
     border = "black",
     breaks = 3)   # you can adjust this

Distribution of Budget

I think that all the results are bunched into the first 3 baskets because some of the movies in the dataset have 0 for the budget for some reason. But this does not look to unaccurate because most movies busgets are between 100 to 200 million dollars, with some even having less.

5:

Create budget groups

filtered_data$budget_group <- cut(
  filtered_data$budget,
  breaks = c(0, 50000000, 150000000, Inf),   # <50M, 50M-150M, >150M
  labels = c("Low", "Medium", "High")
)

Run ANOVA

anova_model <- aov(revenue ~ budget_group, data = filtered_data)
summary(anova_model)

Result:

> summary(anova_model)
Df    Sum Sq   Mean Sq F value Pr(>F)    
budget_group     2 1.400e+20 7.002e+19   18717 <2e-16 ***
Residuals    56526 2.114e+20 3.741e+15                   

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
835244 observations deleted due to missingness

Summary:

Since the p-value is close to being zero, it means that movies grouped by budget have significantly different mean revenues. Budget really does matter for revenue, at least statistically in my dataset.

Tukey Honest Significant Difference post-hoc test

tukey_results <- TukeyHSD(anova_model)

##tukey_results