Loading Packages and Data

library(graphics)
library(ggplot2)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages ------------------------------------------------ tidyverse 1.3.0 --
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## v purrr   0.3.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
## Warning: package 'stringr' was built under R version 3.6.3
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts --------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(knitr)
## Warning: package 'knitr' was built under R version 3.6.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.6.3
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 3.6.3
## Loading required package: magrittr
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(rstatix)
## Warning: package 'rstatix' was built under R version 3.6.3
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
library(broom)
## Warning: package 'broom' was built under R version 3.6.3
library(lattice)

setwd("C:/Users/Daivik/Desktop/EDA/Assignments/Assignment 4")
data <- read.delim("movie_budgets.txt", header = TRUE, sep = " ", dec = ".")

Fitting the Model

data$log.budget <- log(data$budget) 
fit.model <- loess(data$log.budget ~ data$year, data = data, family="symmetric", method="loess", span = 0.75, degree = 2)
ypred<-predict(fit.model)
summary(fit.model)
## Call:
## loess(formula = data$log.budget ~ data$year, data = data, span = 0.75, 
##     degree = 2, family = "symmetric", method = "loess")
## 
## Number of Observations: 5183 
## Equivalent Number of Parameters: 4.83 
## Residual Scale Estimate: 2.631 
## Trace of smoother matrix: 5.27  (exact)
## 
## Control settings:
##   span     :  0.75 
##   degree   :  2 
##   family   :  symmetric      iterations = 4
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE
ss.budget <- sum(scale(data$log.budget, scale=FALSE)^2)
ss.resid <- sum(resid(fit.model)^2)
sprintf("R-Squared of the Fit Loess Line is: ")
## [1] "R-Squared of the Fit Loess Line is: "
1-ss.resid/ss.budget
## [1] 0.1113977

Budget has high variance, which is why log transformation gives better predictions

Budget vs Year

data %>% mutate(loess = predict(loess(data$log.budget ~ data$year, data = data, family="symmetric", method="loess", span = 0.75, degree = 2))) %>% 
  
  ggplot(aes(data$year,data$log.budget)) +
  geom_point(color = "red") +
  geom_line(aes(y = loess), size = 1)

Non strong linear relationship can be observed. We should fit a curved model instead of a linear model. If we fit a linear model, we cannot get the relationship between the data for this case so fitting a curved model would help.

Budget vs Length

data %>% mutate(loess = predict(loess(data$log.budget ~ data$length, data = data, family="symmetric", method="loess", span = 0.3, degree = 2))) %>% 

ggplot(aes(data$length, data$log.budget)) +
geom_point(color = "red") +
geom_line(aes(y = loess), size=1)

For smaller year, a longer movie might have a higher average rating predicted by our model. As the startYears increases, this trend seems to disappear, showing that a longer movie might not have a higher average rating compared to a shorter movie.

Budget vs log(Length)

data %>% mutate(loess = predict(loess(data$log.budget ~ data$year + log(data$length), data = data, family="symmetric", method="loess", span = .95555, degree = 0))) %>% 

ggplot(aes(data$year,data$log.budget)) +
geom_point(color = "red") +
geom_line(aes(y = loess), size = 1)
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL<k 1.396

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL<k 1.396

Conclusions

-> Both linear and curved lines seem to fit the year, as the value seems to progress evenly with log(budget). A curved line is better fit here since there are certain influencial points towards the higher end of the year variable.

-> Curved fit is better since we can observe a non-linearity. The extreme value points might influence the fitted line, which can be handled by complex regression.

-> No, we dont require an interaction between these variables to get a good fit. They are not linearly dependent. Trying to fit a line by gathering the effect of linear combination between the variables is of no use.

-> The span varies based on the smoothness required for the plot. Time: The fitted curved loess line is smooth and seems to fit the data well. Smoothness required in this case was less. Length: The length has outliers relative to time. Time + Length: To capture the effect of linear combination of time and length on the log(budget). From the initial plot, the fitted line required a lot of smoothing because the line has spikes and dips.

-> Significant data points are influencers. Since the fitted line needs to be robust to the influencial data points that might overfit the model, employing a robust fit using M-estimate is a better reliable model.

Budget across Time

data$length.cat <- cut(data$length, breaks = 3, labels = c("Short", "Medium","Long"))

p <- ggplot(data , aes(x=year,y=log.budget), color=drv) +  geom_point() + geom_smooth(method ="lm",se=F) + ggtitle("Comparison of budget across time between movie lengths") + xlab("year") + ylab("log.budget") + theme_bw()

p + facet_wrap(~length.cat)
## `geom_smooth()` using formula 'y ~ x'

ggplot(data, aes(x = year, y = length, fill = length)) + geom_raster() + scale_fill_distiller(palette = "RdYlBu") + coord_fixed() + geom_contour(aes(z = length))
## Warning: Computation failed in `stat_contour()`:
## Number of x coordinates must match number of columns in density matrix.
## Warning: Raster pixels are placed at uneven horizontal intervals and will be
## shifted. Consider using geom_tile() instead.
## Warning: Raster pixels are placed at uneven vertical intervals and will be
## shifted. Consider using geom_tile() instead.

There are very few long movies in the dataset compared to short movies. The short movie across various years are spread across both high and low budget. Increase in budget can be observed with increase in years across all type of time spans in movies.