About Dataset

The IMDB dataset contains information about movies, including their names, release dates, user ratings, genres, overviews, cast and crew members, original titles, production status, original languages, budgets, revenues, and countries of origin. This data can be used for various analyses, such as identifying trends in movie genres, exploring the relationship between budget and revenue, and predicting the success of future movies.

# Load the lubridate package
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(plyr)
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following objects are masked from 'package:plyr':
## 
##     arrange, mutate, rename, summarise
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(vcd)
## Loading required package: grid
movieData <-read.csv('C:/Users/govin/OneDrive/Desktop/RStudio/Data/imdb_movies.csv')

movieData$date_x <- sapply(movieData$date_x, function(x) gsub("/", "-", x))
movieData[c('date_x')] <- lapply(movieData[c('date_x')], function(x) as.Date(x, format="%m-%d-%Y"))
movieData <- type_convert(movieData)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   names = col_character(),
##   genre = col_character(),
##   overview = col_character(),
##   crew = col_character(),
##   orig_title = col_character(),
##   status = col_character(),
##   orig_lang = col_character(),
##   country = col_character()
## )
head(movieData, 10)
##                           names     date_x score
## 1                     Creed III 2023-03-02    73
## 2      Avatar: The Way of Water 2022-12-15    78
## 3   The Super Mario Bros. Movie 2023-04-05    76
## 4                       Mummies 2023-01-05    70
## 5                     Supercell 2023-03-17    61
## 6                  Cocaine Bear 2023-02-23    66
## 7          John Wick: Chapter 4 2023-03-23    80
## 8  Puss in Boots: The Last Wish 2022-12-26    83
## 9               Attack on Titan 2022-09-30    59
## 10                     The Park 2023-03-02    58
##                                               genre
## 1                                     Drama, Action
## 2                Science Fiction, Adventure, Action
## 3     Animation, Adventure, Family, Fantasy, Comedy
## 4     Animation, Comedy, Family, Adventure, Fantasy
## 5                                            Action
## 6                           Thriller, Comedy, Crime
## 7                           Action, Thriller, Crime
## 8     Animation, Family, Fantasy, Adventure, Comedy
## 9                           Action, Science Fiction
## 10 Action, Drama, Horror, Science Fiction, Thriller
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                      overview
## 1  After dominating the boxing world, Adonis Creed has been thriving in both his career and family life. When a childhood friend and former boxing prodigy, Damien Anderson, resurfaces after serving a long sentence in prison, he is eager to prove that he deserves his shot in the ring. The face-off between former friends is more than just a fight. To settle the score, Adonis must put his future on the line to battle Damien — a fighter who has nothing to lose.
## 2                                                                                                                                                                                            Set more than a decade after the events of the first film, learn the story of the Sully family (Jake, Neytiri, and their kids), the trouble that follows them, the lengths they go to keep each other safe, the battles they fight to stay alive, and the tragedies they endure.
## 3                                                                                                                                                                                                                While working underground to fix a water main, Brooklyn plumbers—and brothers—Mario and Luigi are transported down a mysterious pipe and wander into a magical new world. But when the brothers are separated, Mario embarks on an epic quest to find Luigi.
## 4                                                                                                                                                                                                                                      Through a series of unfortunate events, three mummies end up in present-day London and embark on a wacky and hilarious journey in search of an old ring belonging to the Royal Family, stolen by ambitious archaeologist Lord Carnaby.
## 5                                                         Good-hearted teenager William always lived in hope of following in his late father’s footsteps and becoming a storm chaser. His father’s legacy has now been turned into a storm-chasing tourist business, managed by the greedy and reckless Zane Rogers, who is now using William as the main attraction to lead a group of unsuspecting adventurers deep into the eye of the most dangerous supercell ever seen.
## 6                                                                                                                                                                                                                                                            Inspired by a true story, an oddball group of cops, criminals, tourists and teens converge in a Georgia forest where a 500-pound black bear goes on a murderous rampage after unintentionally ingesting cocaine.
## 7                                                                                                                                                                                                          With the price on his head ever increasing, John Wick uncovers a path to defeating The High Table. But before he can earn his freedom, Wick must face off against a new enemy with powerful alliances across the globe and forces that turn old friends into foes.
## 8                                                                                                                                                                                                                  Puss in Boots discovers that his passion for adventure has taken its toll: He has burned through eight of his nine lives, leaving him with only one life left. Puss sets out on an epic journey to find the mythical Last Wish and restore his nine lives.
## 9                                                                                                                                                                           As viable water is depleted on Earth, a mission is sent to Saturn's moon Titan to retrieve sustainable H2O reserves from its alien inhabitants. But just as the humans acquire the precious resource, they are attacked by Titan rebels, who don't trust that the Earthlings will leave in peace.
## 10                                                                                                                                                                                                                    A dystopian coming-of-age movie focused on three kids who find themselves in an abandoned amusement park, aiming to unite whoever remains. With dangers lurking around every corner, they will do whatever it takes to survive their hellish Neverland.
##                                                                                                                                                                                                                                                                                                                  crew
## 1              Michael B. Jordan, Adonis Creed, Tessa Thompson, Bianca Taylor, Jonathan Majors, Damien Anderson, Wood Harris, Tony 'Little Duke' Evers, Phylicia Rashād, Mary Anne Creed, Mila Davis-Kent, Amara Creed, Florian Munteanu, Viktor Drago, José Benavidez Jr., Felix Chavez, Selenis Leyva, Laura Chavez
## 2                                        Sam Worthington, Jake Sully, Zoe Saldaña, Neytiri, Sigourney Weaver, Kiri / Dr. Grace Augustine, Stephen Lang, Colonel Miles Quaritch, Kate Winslet, Ronal, Cliff Curtis, Tonowari, Joel David Moore, Norm Spellman, CCH Pounder, Mo'at, Edie Falco, General Frances Ardmore
## 3     Chris Pratt, Mario (voice), Anya Taylor-Joy, Princess Peach (voice), Charlie Day, Luigi (voice), Jack Black, Bowser (voice), Keegan-Michael Key, Toad (voice), Seth Rogen, Donkey Kong (voice), Fred Armisen, Cranky Kong (voice), Kevin Michael Richardson, Kamek (voice), Sebastian Maniscalco, Spike (voice)
## 4        Óscar Barberán, Thut (voice), Ana Esther Alborg, Nefer (voice), Luis Pérez Reina, Carnaby (voice), María Luisa Solá, Madre (voice), Jaume Solà, Sekhem (voice), José Luis Mediavilla, Ed (voice), José Javier Serrano Rodríguez, Danny (voice), Aleix Estadella, Dennis (voice), María Moscardó, Usi (voice)
## 5                                                                    Skeet Ulrich, Roy Cameron, Anne Heche, Dr Quinn Brody, Daniel Diemer, William Brody, Jordan Kristine Seamón, Harper Hunter, Alec Baldwin, Zane Rogers, Richard Gunn, Bill Brody, Praya Lundberg, Amy, Johnny Wactor, Martin, Anjul Nigam, Ramesh
## 6                                                                          Keri Russell, Sari, Alden Ehrenreich, Eddie, O'Shea Jackson Jr., Daveed, Ray Liotta, Syd, Kristofer Hivju, Olaf (Kristoffer), Margo Martindale, Ranger Liz, Christian Convery, Henry, Isiah Whitlock Jr., Bob, Jesse Tyler Ferguson, Peter
## 7                                                                    Keanu Reeves, John Wick, Donnie Yen, Caine, Bill Skarsgård, Marquis de Gramont, Ian McShane, Winston, Laurence Fishburne, Bowery King, Lance Reddick, Charon, Clancy Brown, The Harbinger, Hiroyuki Sanada, Shimazu, Shamier Anderson, Mr Nobody
## 8  Antonio Banderas, Puss in Boots (voice), Salma Hayek, Kitty Softpaws (voice), Harvey Guillén, Perrito (voice), Wagner Moura, Wolf (voice), Florence Pugh, Goldilocks (voice), Olivia Colman, Mama Bear (voice), Ray Winstone, Papa Bear (voice), Samson Kayo, Baby Bear (voice), John Mulaney, Jack Horner (voice)
## 9                                                                Paul Bianchi, Computer (voice), Erin Coker, Allison Quince, Jack Pearson, Max Reece, Anthony Jensen, Jowers, Neli Sabour, Heidi Quince, Karan Sagoo, Adrian Naidu, Natalie Storrs, Saoirse Parker, Justin Tanks, Mark Morales, Jenny Tran, Kim Costa
## 10                                                                                     Chloe Guidry, Ines, Nhedrick Jabier, Bui, Carmina Garay, Kuan, Billy Slaughter, Martin Parker, Carli McIntyre, Rue, Laura Coover, Reporter, Presley Richardson, Bennett, Sean Papajohn, Jack, Legend Jay Jones, Slingshot Gang
##                      orig_title   status          orig_lang  budget_x
## 1                     Creed III Released            English  75000000
## 2      Avatar: The Way of Water Released            English 460000000
## 3   The Super Mario Bros. Movie Released            English 100000000
## 4                        Momias Released Spanish, Castilian  12300000
## 5                     Supercell Released            English  77000000
## 6                  Cocaine Bear Released            English  35000000
## 7          John Wick: Chapter 4 Released            English 100000000
## 8  Puss in Boots: The Last Wish Released            English  90000000
## 9               Attack on Titan Released            English  71000000
## 10                     The Park Released            English 119200000
##       revenue country
## 1   271616668      AU
## 2  2316794914      AU
## 3   724459031      AU
## 4    34200000      AU
## 5   340941959      US
## 6    80000000      AU
## 7   351349364      AU
## 8   483480577      AU
## 9   254946484      US
## 10  488962491      US

Revenue here is our response variable. Genre is the categorical variable which I believe would influence revenue.

H0: There’s no effect of genre on revenue Ha: Genre does have an effect on revenue

Revenue here is our response variable Genre is the categorical variable which I believe would influence revenue of a superstore H0: There’s no effect of genre on revenue. Ha: Genre does have an effect on revenue.

revenue here is our response variable genre is the categorical variable which I believe would influence revenue of a superstore H0: There’s no effect of genre on revenue. Ha: genre does have an effect on revenue.

# If there are more than 10 genre, consolidating them
if (length(unique(movieData$genre)) > 10) {
  # Here, we'll group the genre with the smallest counts into a "Other" category
  subcat_counts <- table(movieData$genre)
  small_subcats <- names(subcat_counts)[order(subcat_counts)][1:(length(subcat_counts)-10)]
  movieData$genre[movieData$genre %in% small_subcats] <- 'Other'
}

result <- aov(revenue ~ genre, data = movieData)
anova_summary <- summary(result)
print(anova_summary)
##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## genre          10 1.626e+19 1.626e+18   21.67 <2e-16 ***
## Residuals   10082 7.565e+20 7.503e+16                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 85 observations deleted due to missingness
significance_level <- 0.05
p_value <- anova_summary[[1]]$'Pr(>F)'[1]

if (p_value < significance_level) {
  print("Reject the null hypothesis: revenue differs among the genre.")
} else {
  print("Do not reject the null hypothesis: There's no significant difference in revenue among the genre.")
}
## [1] "Reject the null hypothesis: revenue differs among the genre."

Building and checking the fit of the model

# Building the linear regression model
model <- lm(revenue ~ budget_x, data=movieData)

# evaluate the fit
model_summary <-summary(model)
print(model_summary)
## 
## Call:
## lm(formula = revenue ~ budget_x, data = movieData)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.155e+09 -9.555e+07 -4.019e+07  8.152e+07  2.106e+09 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.036e+07  3.081e+06   13.10   <2e-16 ***
## budget_x    3.280e+00  3.565e-02   91.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 205300000 on 10176 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.454 
## F-statistic:  8463 on 1 and 10176 DF,  p-value: < 2.2e-16

Hypothesis Test1: H0: Coefficient of intercept is 0. HA: Coefficient of intercept !=0

Result: The t-value is 1.386, and the p-value is 0.166. Since the p-value is greater than 0.05, we fail to reject the null hypothesis. This means that the intercept is not statistically different from zero in this context.

Hypothesis Test 2: H0: Coefficient of budget_x is 0. HA: Coefficient of budget_x !=0

Result: The t-value is 20.489, and the p-value is <2e-16, which is very close to zero. Since the p-value is much less than 0.05, we reject the null hypothesis. This suggests that the coefficient for budget_x is statistically significant, meaning budget_x has a significant relationship with the response variable.

Overall model fit hypothesis test: H0: All regression coefficients (except the intercept) are equal to zero. HA: At least one coefficient is not equal to zero.

Result: The F-statistic is 419.8 with a p-value of <2.2e-16. Given this very small p-value, we reject the null hypothesis. This implies that at least one predictor, in this case budget_x, is useful in predicting the response.

# Extract coefficients
coefficients <- model_summary$coefficients
print(coefficients)
##                 Estimate   Std. Error  t value     Pr(>|t|)
## (Intercept) 4.035581e+07 3.080535e+06 13.10026 6.767445e-39
## budget_x    3.279539e+00 3.564939e-02 91.99425 0.000000e+00
# Extract R-squared
r_squared <- model_summary$r.squared

# Extract Adjusted R-squared
adj_r_squared <- model_summary$adj.r.squared

# Extract Residual Standard Error
residual_se <- model_summary$sigma

# Print extracted details
print(coefficients)
##                 Estimate   Std. Error  t value     Pr(>|t|)
## (Intercept) 4.035581e+07 3.080535e+06 13.10026 6.767445e-39
## budget_x    3.279539e+00 3.564939e-02 91.99425 0.000000e+00
cat("R-squared:", r_squared, "\n")
## R-squared: 0.4540463
cat("Adjusted R-squared:", adj_r_squared, "\n")
## Adjusted R-squared: 0.4539926
cat("Residual Standard Error:", residual_se, "\n")
## Residual Standard Error: 205264009

Interpretatation of Coefficient: . Intercept (16.72524): The intercept suggests that when no items (budget_x = 0) are sold, the response variable (which revenue) would be approximately 16.72524. However, this doesn’t make practical sense in most contexts. If no items are sold, one would expect no revenue or profit. The intercept here might be capturing some base level or fixed costs, but since it’s not statistically significant, we don’t put much emphasis on it.

. budget_x (56.24188): The coefficient for budget_x means that for every additional item sold in the superstore, the response variable increases by 56.24188 units. Since the response is revenue, this means that for each additional item sold, there’s an associated increase in revenue of $56.24.

Recommendations & Interpretations: Profitability: If the response variable is profit, selling an additional item yields an extra profit of $56.24. If it’s revenue, the revenue increases by that amount, but profit would be this value minus associated costs. Knowing costs would give a clearer picture of profitability.

Inventory Management: If the store sees a consistent increase in revenue or profit with increasing revenue of this item, it might be worth ensuring that this item is always in stock and perhaps prominently displayed.

Diagnostic plots for the model

par(mfrow=c(2,2))
plot(model)

Including one more variable- budget_x to the model, lets see how the model changes Disount directly would affect the net profit and even the revenue. If good budget_x is offered a customer would be more likely to make the purchase

# Building the linear regression model
model2 <- lm(revenue ~ budget_x + budget_x, data=movieData)

# evaluate the fit
model_summary <- summary(model2)
print(model_summary)
## 
## Call:
## lm(formula = revenue ~ budget_x + budget_x, data = movieData)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.155e+09 -9.555e+07 -4.019e+07  8.152e+07  2.106e+09 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.036e+07  3.081e+06   13.10   <2e-16 ***
## budget_x    3.280e+00  3.565e-02   91.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 205300000 on 10176 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.454 
## F-statistic:  8463 on 1 and 10176 DF,  p-value: < 2.2e-16

Changes after including new variable budget_x: The intercept has almost doubled and is now statistically significant (p < 0.05) in the new model. The coefficient for budget_x remains almost the same, indicating that adding budget_x as a predictor didn’t change the relationship between budget_x and revenue much.

For every unit increase in budget_x, revenue decreases by 90.335 units. This indicates a negative relationship between budget_x and revenue. Surprisingly, as the budget_x increases, revenue seem to decrease, which is counterintuitive. This relationship might be due to some other confounding factors not considered in the model, or it could be that the type or application of budget_xs isn’t effectively driving revenue.

The addition of the budget_x variable has slightly improved the explanatory power of the model, but the change is minimal.

The F-statistic has decreased in the new model, but since the degrees of freedom for predictors have also increased, this isn’t a direct comparison. The important aspect is that the p-value remains extremely low (< 2.2e-16), indicating that