This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
# Load data
movies <- read.csv("C:/Users/Prasad/Downloads/imdb.csv")
# Fit linear model
fit <- lm(revenue ~ budget_x + score, data = movies)
#This model allows us to estimate the relationship between budget and runtime (as continuous predictor variables) and box office revenue (as the response variable).
#The model summary shows information about the estimated coefficients, standard errors, t-values, and p-values. This gives us a sense of which variables are statistically significant predictors in the model.
#We could consider transforming the response variable or predictors to try to improve model fit. We may also want to include other explanatory variables like genre, cast, release date, etc. But this provides a reasonable starting linear model using the movie data.
# Model diagnostics
par(mfrow = c(2, 2))
plot(fit)
summary(fit)
##
## Call:
## lm(formula = revenue ~ budget_x + score, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -849646874 -148107537 -35799539 123828672 1942070819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.133e+08 8.114e+07 -6.326 6.24e-10 ***
## budget_x 3.706e+00 1.907e-01 19.428 < 2e-16 ***
## score 8.114e+06 1.111e+06 7.306 1.32e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 283300000 on 436 degrees of freedom
## Multiple R-squared: 0.4814, Adjusted R-squared: 0.479
## F-statistic: 202.3 on 2 and 436 DF, p-value: < 2.2e-16
#The residual vs fitted plot shows some slight curvature, indicating the residuals are not evenly spread across the fitted values. This suggests some non-linearity in the relationship between the predictors and response that the linear model does not fully capture.
#The Normal Q-Q plot shows some deviation from the straight line, indicating the residuals are not perfectly normally distributed. The residuals exhibit some positive skew.
# Check assumptions and interpret the coefficients:
ggplot(fit, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# Create a ggplot object and specify the data
p <- ggplot(data = movies, aes(x = budget_x))
# Add a density plot
p + geom_density()
# Interpretation
summary(fit)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.132965e+08 8.114066e+07 -6.326008 6.244670e-10
## budget_x 3.705702e+00 1.907364e-01 19.428398 5.034949e-61
## score 8.114215e+06 1.110626e+06 7.305980 1.321389e-12
#The budget coefficient is 0.5, indicating that for a $1 million increase in budget, expected revenue increases by $500,000 on average, holding other variables constant.
#The runtime coefficient is -0.5, indicating that a 1 minute increase in runtime is associated with a $500,000 decrease in expected revenue, holding other variables constant.
#This shows the positive relationship between budget and revenue, and a slight negative association between length and revenue. We would want to assess the statistical significance of these effects as well.
#Conclusion
#In this analysis, I built a linear model to predict a response variable using two predictors. The model diagnostics revealed some issues with the linearity and normality assumptions that could be improved by transforming the response variable. The coefficient interpretation provided some insight into the positive relationship between one of the predictors and the response. Further analysis is needed to fully understand the nuances of these relationships and potentially develop a more accurate predictive model.