It is observed that the critic score earned by a movie is based on various factors such as Actors, Directors, Plot of movie, revenue, etc. The report deals with all these kinds of factors and highlighting the important ones and developing a regression model based on the imdb_d to predict the average critic score earned by a movie.
The dataset in this project is extracted from Kaggle.com open datasets: IMDB New Dataset. The dataset was acquired using BeautifulSoup4 from the search page of Imdb.
Source:- Kaggle.com. 2021. IMDB New Dataset. [online] Available at: https://www.kaggle.com/wrandrall/imdb-new-dataset?select=imdb_db [Accessed 4 June 2021].
The dataset has 189900 rows and 14 columns. Each observation represents a movie. The movies are ranked by the number of votes from high to low. The movie listed are taken from 1900-2020.
There are 6 numerical and 8 categorical variables present in the dataset. They are listed below: The target Feature is:
Score: The mean score allotted to the movie between 1-10 by the journalists/critics.
Movie Name: the name of the movie/Series.
Movie date: The year when the movie released.
Series Name: the name of the series season (if any)
Series date: the year when the series released.
Movie type: genre of movie (action, drama, sci-fi..)
Number of votes: number of people who voted for the meta score.
Movie revenue in million $: Box-office revenue made by the movie in million $
Meta score: the mean score allotted to the movie/series from 1-100 by viewers.
Time Duration: the duration of the movie in minutes.
Director: list of directors that directed the movie/series.
Actors: list of main actors that played in the movie/series.
Restrictions: Age restriction and warning (all public, all public with warning, 12 ,12 with warnings, 16, ….)
Description: A short summary of the movie.
To start with our report, we have to import necessary R packages:
library(readr)
library(dplyr)
library(tidyr)
library(plotly)
library(car)
After importing the neccessary packages, we have to import the imdb databaset/ We imported the dataset using read_csv function and named the dataset as imdb_db and have a glimpse at our datatset.
imdb_db <- read_csv("imdb_db.csv")
imdb_db
Our dataset has alot of observations. So, we will take top 1500 observations which represents top score of 1500 movies and name them as imdb.
imdb <- imdb_db %>% slice(1:1500)
imdb
Now, let’s check for any misng values in our dataset.
colSums(is.na(imdb))
Movie Name Movie Date Serie Name Serie Date
0 0 1500 1500
Movie Type Number of Votes Movie Revenue (M$) Score
0 0 60 0
Metascore Time Duration (min) Director Actors
60 0 0 0
Restriction Description
0 0
As we can see that our dataset has missing values in columns Serie Name, Serie data, Meta score and Movie Revenue(M$). So, before moving further we have remove them. We will remove them by using drop.na() and select functions.
imdb <- drop_na(imdb,`Movie Revenue (M$)`,Metascore)
imdb <- select(imdb,-c(3,4))
We have removed the missing values and just checking if we still have any missing values present in the imdb dataset.
colSums(is.na(imdb))
Movie Name Movie Date Movie Type Number of Votes
0 0 0 0
Movie Revenue (M$) Score Metascore Time Duration (min)
0 0 0 0
Director Actors Restriction Description
0 0 0 0
As we can see that there are no missing values present in our imdb dataset. Now , we will check if our datase contains any special values such as NAN,infinite values etc.
is.special <- function(x){ if (is.numeric(x)) (is.infinite(x) | is.nan(x))}
sapply(imdb, function(x) sum( is.special(x)))
Movie Name Movie Date Movie Type Number of Votes
0 0 0 0
Movie Revenue (M$) Score Metascore Time Duration (min)
0 0 0 0
Director Actors Restriction Description
0 0 0 0
As we can see that our dataset doesn’t contain any special values. Now, we will move forward to see some descriptive stats of our dataset.
glimpse(imdb)
Rows: 1,440
Columns: 12
$ `Movie Name` <chr> "Les évadés", "The Dark Knight: Le chevalier noir", "Inception~
$ `Movie Date` <dbl> 1994, 2008, 2010, 1999, 1994, 1994, 1999, 2001, 2003, 1972, 20~
$ `Movie Type` <chr> "['Drama']", "['Action', 'Crime', 'Drama']", "['Action', 'Adve~
$ `Number of Votes` <dbl> 2294987, 2259829, 2021865, 1819635, 1792272, 1768611, 1643323,~
$ `Movie Revenue (M$)` <dbl> 28341469, 534858444, 292576195, 37030102, 107928762, 330252182~
$ Score <dbl> 9.3, 9.0, 8.8, 8.8, 8.9, 8.8, 8.7, 8.8, 8.9, 9.2, 8.4, 8.6, 8.~
$ Metascore <dbl> 80, 84, 74, 66, 94, 82, 73, 92, 94, 100, 78, 74, 87, 65, 81, 6~
$ `Time Duration (min)` <dbl> 142, 152, 148, 139, 154, 142, 136, 178, 201, 175, 164, 169, 17~
$ Director <chr> "['Frank Darabont']", "['Christopher Nolan']", "['Christopher ~
$ Actors <chr> "['Tim Robbins', 'Morgan Freeman', 'Bob Gunton', 'William Sadl~
$ Restriction <chr> "Tous publics", "Tous publics", "Tous publics", "16", "12", "T~
$ Description <chr> "Two imprisoned men bond over a number of years, finding solac~
class(imdb)
[1] "tbl_df" "tbl" "data.frame"
imdb %>% summary()
Movie Name Movie Date Movie Type Number of Votes Movie Revenue (M$)
Length:1440 Min. :1972 Length:1440 Min. : 933448 Min. : 6719864
Class :character 1st Qu.:1994 Class :character 1st Qu.:1026328 1st Qu.: 91574114
Mode :character Median :2000 Mode :character Median :1166181 Median :167142682
Mean :1999 Mean :1284147 Mean :213513229
3rd Qu.:2006 3rd Qu.:1460896 3rd Qu.:310730244
Max. :2014 Max. :2294987 Max. :760507625
Score Metascore Time Duration (min) Director Actors
Min. :7.800 Min. : 58.00 Min. : 98.0 Length:1440 Length:1440
1st Qu.:8.300 1st Qu.: 68.75 1st Qu.:123.5 Class :character Class :character
Median :8.500 Median : 77.50 Median :142.5 Mode :character Mode :character
Mean :8.504 Mean : 77.90 Mean :146.2
3rd Qu.:8.700 3rd Qu.: 87.00 3rd Qu.:166.0
Max. :9.300 Max. :100.00 Max. :202.0
Restriction Description
Length:1440 Length:1440
Class :character Class :character
Mode :character Mode :character
From above, we can see that our dataset have 1440 rows and 12 columns and a summary of our entire dataset is presented above.
Here, we are showing a possible relationship between regressor and the response variable by plotting the graph and correlation.
The first graph we are witnessing is showing the relationship between our Response variable Score and regressor Movie Revenue. As we can see that there is not a clear but visible positive linear trend between our Regressor and response variable.
# relation between variables
fig1 <- plot_ly(data = imdb, x = ~Score, y = ~imdb$`Movie Revenue (M$)`)
fig1
Let’s confirm our relationship between variable by performing correlation.
cormovie <- cor.test(imdb$Score, imdb$`Movie Revenue (M$)`,
method = "pearson")
cormovie
Pearson's product-moment correlation
data: imdb$Score and imdb$`Movie Revenue (M$)`
t = -14.754, df = 1438, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.4066409 -0.3168782
sample estimates:
cor
-0.3626002
From correlation we see that:
T value is -14.754(t-test statistic)
df is 1438(degrees of freedom)
p-value is the significance level of the t-test (p-value < 2.21).
conf.int is the confidence interval of the correlation coefficient at 95% (conf.int = [-0.4066, -0.3168]).
sample estimates is the correlation coefficient (Cor.coeff = -0.36).
So, we can say that our regressor and response variable are somewhere shows correlation but is not very strong.
Now, second we will show the relationship between Metascore and Score variable by displaying a scatter plot.
fig2 <- plot_ly(data = imdb, x = ~Score, y = ~imdb$Metascore)
fig2
The Second graph we are witnessing is showing the relationship between our Response variable Score and regressor Metascore. As we can see that there is not a clear but visible positive linear trend between our Regressor and response variable.
Let’s confirm our relationship between variable by performing correlation.
cormeta <- cor.test(imdb$Score, imdb$Metascore,
method = "pearson")
cormeta
Pearson's product-moment correlation
data: imdb$Score and imdb$Metascore
t = 18.529, df = 1438, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3963395 0.4797856
sample estimates:
cor
0.4390088
From correlation we see that:
T value is 18.529(t-test statistic)
df is 1438(degrees of freedom)
p-value is the significance level of the t-test (p-value < 2.22).
conf.int is the confidence interval of the correlation coefficient at 95% (conf.int = [0.3963, 0.4797]).
sample estimates is the correlation coefficient (Cor.coeff = 0.4390).
So, we can say that our regressor and response variable are somewhere shows correlation but is not very strong.
Now, we are going to represent the relationship between Restriction and Score variable by plotting a boxplot.
fig3 <- plot_ly(data = imdb, x = ~Restriction, y = ~Score, type = "box")
fig3 <- fig3 %>% layout(boxmode = "group")
fig3
From the above boxplot, we can see the average Score of the movie as per different age retrictions such as for Age group of 12, the average score is 8.5 while for Age group of 16 the average score is 8.65 and so on.
Now, we will fit various regression model and perform tests to check the assumptions and transform the data.
First, we are fitting Multiple Linear Regression on our imdb Dataset.
Model1 <- lm(Score~ imdb$`Movie Date`+imdb$`Number of Votes`+imdb$`Movie Revenue (M$)`
+imdb$Metascore+imdb$`Time Duration (min)`, imdb)
summary(Model1)
Call:
lm(formula = Score ~ imdb$`Movie Date` + imdb$`Number of Votes` +
imdb$`Movie Revenue (M$)` + imdb$Metascore + imdb$`Time Duration (min)`,
data = imdb)
Residuals:
Min 1Q Median 3Q Max
-0.255330 -0.080766 -0.008307 0.086434 0.198411
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.014e+01 6.231e-01 48.37 <2e-16 ***
imdb$`Movie Date` -1.154e-02 3.087e-04 -37.39 <2e-16 ***
imdb$`Number of Votes` 6.572e-07 8.967e-09 73.29 <2e-16 ***
imdb$`Movie Revenue (M$)` -7.935e-10 1.785e-11 -44.46 <2e-16 ***
imdb$Metascore 7.087e-03 2.906e-04 24.39 <2e-16 ***
imdb$`Time Duration (min)` 1.461e-03 1.111e-04 13.16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1083 on 1434 degrees of freedom
Multiple R-squared: 0.8943, Adjusted R-squared: 0.8939
F-statistic: 2425 on 5 and 1434 DF, p-value: < 2.2e-16
Here, we have fitted a Multiple Linear Regression model with five variables and name the model as Multiple Model. We can see that Multiple R2 value is 0.8943 and Adj R2 Value is 0.8939. We can also see that the P-value is smaller then 0.05, therefore we can say that the model is significant and we can reject H0.
To check the overall fit, we will use Anova function.
anova(Model1)
Analysis of Variance Table
Response: Score
Df Sum Sq Mean Sq F value Pr(>F)
imdb$`Movie Date` 1 34.100 34.100 2908.92 < 2.2e-16 ***
imdb$`Number of Votes` 1 81.233 81.233 6929.59 < 2.2e-16 ***
imdb$`Movie Revenue (M$)` 1 16.680 16.680 1422.92 < 2.2e-16 ***
imdb$Metascore 1 8.122 8.122 692.88 < 2.2e-16 ***
imdb$`Time Duration (min)` 1 2.029 2.029 173.10 < 2.2e-16 ***
Residuals 1434 16.810 0.012
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From Anova Test, we found out that P value for all the variables are 0.001 which is less than 0.005. Therefore,we can say that all the variables are statiscally significant and the model is a good fit.
Now, we will display the graphs to test the assumptions.
par(mfrow = c(2,2))
plot(Model1)
we can see 4 Diagnostic plots known as Residuals plot, QQ plots, Scale-location plot and leverage plots.
In the 1st graph we can see that on X axis has predicted value known as Y^(y hat) and on y axis there are residuals and errors. Here we can the line is not flat and the points are looking as a bunch of clouds which means that linearity assumption is violated here.
In the next Graph (QQ plot), here y axis is ordered, observed and Standardized residuals and on X axis it has ordered theoretical residuals. In the graph we can see that residuals are truly normally distributed as mostly points are falling on the line but some were falling out of line in the end.
The next two shows that the regression is non-linear, non-constant variance.
To test further, if any assumptions are violated or not. We will perform some tests.
durbinWatsonTest(Model1)
lag Autocorrelation D-W Statistic p-value
1 -0.2078972 2.41538 0
Alternative hypothesis: rho != 0
H0: Errors are uncorrelated
H1: Errors are correlated
From the Durbin Waston test, Since the p-value is < 0.05 we have enough evidence to reject H0. This implies that uncorrelated error assumption is violated.
shapiro.test(Model1$residuals)
Shapiro-Wilk normality test
data: Model1$residuals
W = 0.97392, p-value = 1.766e-15
H0: Errors are normally distributed
H1: Errors are not normally distributed
From the Shapiro-wilk normality test, Since the p-value is < 0.05 we have enough evidence to reject H0. This implies that normality error assumption is violated.
ncvTest(Model1)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 8.627698, Df = 1, p = 0.0033109
H0: Errors have a constant variance
H1: Errors have a non-constant variance
Here, we can see that p value is less then 0.05. therefore, we can say that the constant variance assumptions is violated here.
After performing the test, we found out that few assumptions were violated here. So, we will try to transform the data using Log Transformation.
We will make a new model and name it as log1 and fit the transformed value.
# log transformation
log1 <-lm(log(imdb$Score)~ imdb$`Movie Date`+imdb$`Number of Votes`+imdb$`Movie Revenue (M$)`
+imdb$Metascore+imdb$`Time Duration (min)`, imdb)
summary(log1)
Call:
lm(formula = log(imdb$Score) ~ imdb$`Movie Date` + imdb$`Number of Votes` +
imdb$`Movie Revenue (M$)` + imdb$Metascore + imdb$`Time Duration (min)`,
data = imdb)
Residuals:
Min 1Q Median 3Q Max
-0.0315958 -0.0106862 -0.0003744 0.0107547 0.0240024
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.651e+00 7.567e-02 61.47 <2e-16 ***
imdb$`Movie Date` -1.340e-03 3.749e-05 -35.74 <2e-16 ***
imdb$`Number of Votes` 7.667e-08 1.089e-09 70.41 <2e-16 ***
imdb$`Movie Revenue (M$)` -9.587e-11 2.167e-12 -44.24 <2e-16 ***
imdb$Metascore 8.350e-04 3.529e-05 23.66 <2e-16 ***
imdb$`Time Duration (min)` 1.663e-04 1.349e-05 12.33 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01315 on 1434 degrees of freedom
Multiple R-squared: 0.8875, Adjusted R-squared: 0.8871
F-statistic: 2263 on 5 and 1434 DF, p-value: < 2.2e-16
Here, we have fitted a transformed Multiple Linear Regression model with five variables and name the model as log1 . We can see that Multiple R2 value is 0.8875 and Adj R2 Value is 0.8871 We can also see that the P-value is smaller then 0.05, therefore we can say that the model is significant and we can reject H0.
To check the overall fit, we will use Anova function.
anova(log1)
Analysis of Variance Table
Response: log(imdb$Score)
Df Sum Sq Mean Sq F value Pr(>F)
imdb$`Movie Date` 1 0.46903 0.46903 2713.65 < 2.2e-16 ***
imdb$`Number of Votes` 1 1.10096 1.10096 6369.82 < 2.2e-16 ***
imdb$`Movie Revenue (M$)` 1 0.24749 0.24749 1431.89 < 2.2e-16 ***
imdb$Metascore 1 0.11226 0.11226 649.49 < 2.2e-16 ***
imdb$`Time Duration (min)` 1 0.02628 0.02628 152.03 < 2.2e-16 ***
Residuals 1434 0.24785 0.00017
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From Anova Test, we found out that P value for all the variables are 0.001 which is less than 0.005. Therefore,we can say that all the variables are statiscally significant and the model is a good fit.
Now, we will display the graphs to test the assumptions.
par(mfrow = c(2,2))
plot(log1)
After the transformation, we can say that:
In the 1st graph we can see that on X axis has predicted value known as Y^(y hat) and on y axis there are residuals and errors. Here we can the line is not flat and the points are looking as a bunch of clouds which means that linearity assumption is still violated here.
In the next Graph (QQ plot), here y axis is ordered, observed and Standardized residuals and on X axis it has ordered theoretical residuals. In the graph we can see that residuals are truly normally distributed as mostly points are falling on the line but some were falling out of line in the end. It is almost same as before transformation.
The next two shows that the regression is non-linear, non-constant variance. To test further, if any assumptions are violated or not. We will perform some tests.
To Test for Auto correlated Errors, we will perform DW-test
durbinWatsonTest(log1)
lag Autocorrelation D-W Statistic p-value
1 -0.2085987 2.416514 0
Alternative hypothesis: rho != 0
H0: Errors are uncorrelated
H1: Errors are correlated
From the Durbin Waston test, Since the p-value is < 0.05 we have enough evidence to reject H0. This implies that uncorrelated error assumption is violated.
shapiro.test(log1$residuals)
Shapiro-Wilk normality test
data: log1$residuals
W = 0.97352, p-value = 1.295e-15
H0: Errors are normally distributed
H1: Errors are not normally distributed
From the Shapiro-wilk normality test, Since the p-value is < 0.05 we have enough evidence to reject H0.
This implies that normality error assumption is violated.
ncvTest(log1)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 8.671053, Df = 1, p = 0.003233
H0: Errors have a constant variance
H1: Errors have a non-constant variance
Here, we can see that p value is less then 0.05. therefore, we can say that the constant variance assumptions is still violated here.
Now, we will perform another regression analysis on our dataset named Stepwise Regression Below we have fitted a stepwise model.
step(Model1, data=imdb, direction="both")
Start: AIC=-6396.59
Score ~ imdb$`Movie Date` + imdb$`Number of Votes` + imdb$`Movie Revenue (M$)` +
imdb$Metascore + imdb$`Time Duration (min)`
Df Sum of Sq RSS AIC
<none> 16.810 -6396.6
- imdb$`Time Duration (min)` 1 2.029 18.839 -6234.5
- imdb$Metascore 1 6.971 23.781 -5899.0
- imdb$`Movie Date` 1 16.391 33.201 -5418.5
- imdb$`Movie Revenue (M$)` 1 23.174 39.984 -5150.8
- imdb$`Number of Votes` 1 62.960 79.770 -4156.3
Call:
lm(formula = Score ~ imdb$`Movie Date` + imdb$`Number of Votes` +
imdb$`Movie Revenue (M$)` + imdb$Metascore + imdb$`Time Duration (min)`,
data = imdb)
Coefficients:
(Intercept) imdb$`Movie Date` imdb$`Number of Votes`
3.014e+01 -1.154e-02 6.572e-07
imdb$`Movie Revenue (M$)` imdb$Metascore imdb$`Time Duration (min)`
-7.935e-10 7.087e-03 1.461e-03
After performing the step wise regression model, we found out that AIC is -6396.59 and our first model is the best fitted model with the lowest AIC. The variables included in that model with their coefficients are:
-Intercept: 3.014
-Movie Data: -1.154
-Number of votes:6.572
-Movie Revenue: -7.935
-Metascore:7.087
-Time Duration:1.461.
After evaluating our best fit model, we fit the model and perform analysis on it.
Model2 = lm(Score ~ Metascore + imdb$`Movie Revenue (M$)` +
imdb$`Number of Votes` + imdb$`Movie Date`+imdb$`Time Duration (min)`, data = imdb)
summary(Model2 )
Call:
lm(formula = Score ~ Metascore + imdb$`Movie Revenue (M$)` +
imdb$`Number of Votes` + imdb$`Movie Date` + imdb$`Time Duration (min)`,
data = imdb)
Residuals:
Min 1Q Median 3Q Max
-0.255330 -0.080766 -0.008307 0.086434 0.198411
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.014e+01 6.231e-01 48.37 <2e-16 ***
Metascore 7.087e-03 2.906e-04 24.39 <2e-16 ***
imdb$`Movie Revenue (M$)` -7.935e-10 1.785e-11 -44.46 <2e-16 ***
imdb$`Number of Votes` 6.572e-07 8.967e-09 73.29 <2e-16 ***
imdb$`Movie Date` -1.154e-02 3.087e-04 -37.39 <2e-16 ***
imdb$`Time Duration (min)` 1.461e-03 1.111e-04 13.16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1083 on 1434 degrees of freedom
Multiple R-squared: 0.8943, Adjusted R-squared: 0.8939
F-statistic: 2425 on 5 and 1434 DF, p-value: < 2.2e-16
Here, we have fitted a AIC model with the best model. We can see that Multiple R2 value is 0.8943 and Adj R2 Value is 0.8939. We can also see that the P-value is smaller then 0.05, therefore we can say that the model is significant and we can reject H0.
To check the overall fit, we will use Anova function.
anova(Model2)
Analysis of Variance Table
Response: Score
Df Sum Sq Mean Sq F value Pr(>F)
Metascore 1 30.639 30.639 2613.7 < 2.2e-16 ***
imdb$`Movie Revenue (M$)` 1 30.624 30.624 2612.4 < 2.2e-16 ***
imdb$`Number of Votes` 1 62.372 62.372 5320.7 < 2.2e-16 ***
imdb$`Movie Date` 1 16.500 16.500 1407.6 < 2.2e-16 ***
imdb$`Time Duration (min)` 1 2.029 2.029 173.1 < 2.2e-16 ***
Residuals 1434 16.810 0.012
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From Anova Test, we found out that P value for all the variables are 0.001 which is less than 0.005. Therefore,we can say that all the variables are statiscally significant and the model is a good fit.
Now, we will check if any assumptions is violated or not.
par(mfrow=c(2,2))
plot(Model2 )
we can see 4 Diagnostic plots known as Residuals plot, QQ plots, Scale-location plot and leverage plots. In the 1st graph we can see that on X axis has predicted value known as Y^(y hat) and on y axis there are residuals and errors. Here we can the line is not flat and the points are looking as a bunch of clouds which means that linearity assumption is violated here.
In the next Graph (QQ plot), here y axis is ordered, observed and Standardized residuals and on X axis it has ordered theoretical residuals. In the graph we can see that residuals are truly normally distributed as mostly points are falling on the line but some were falling out of line in the end.
The next two shows that the regression is non-linear, non-constant variance. To test further, if any assumptions are violated or not. We will perform some tests.
Now to move further in our Analysis, we will perform some statiscal test.
ncvTest(Model2 )
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 8.627698, Df = 1, p = 0.0033109
H0: Errors have a constant variance
H1: Errors have a non-constant variance
Here, we can see that p value is less then 0.05. therefore, we can say that the constant variance assumptions is violated here.
durbinWatsonTest(Model2)
lag Autocorrelation D-W Statistic p-value
1 -0.2078972 2.41538 0
Alternative hypothesis: rho != 0
H0: Errors are uncorrelated
H1: Errors are correlated
From the Durbin Waston test, Since the p-value is < 0.05 we have enough evidence to reject H0. This implies that uncorrelated error assumption is violated.
shapiro.test(Model2 $residuals)
Shapiro-Wilk normality test
data: Model2$residuals
W = 0.97392, p-value = 1.766e-15
H0: Errors are normally distributed
H1: Errors are not normally distributed
From the Shapiro-wilk normality test, Since the p-value is < 0.05 we have enough evidence to reject H0. This implies that normality error assumption is violated.
After performing the test, we found out that few assumptions are violated here. So, we will try to transform the data using Square-root Transformation.
We will make a new model and name it as sqrt and fit the transformed value. We will transform both the values of Regressor and predictor.
#sqrt Transformation
sqrt <- lm(sqrt(Score) ~ (Metascore + imdb$`Movie Revenue (M$)` +
imdb$`Number of Votes` + imdb$`Movie Date`+imdb$`Time Duration (min)`), data = imdb)
summary(sqrt)
Call:
lm(formula = sqrt(Score) ~ (Metascore + imdb$`Movie Revenue (M$)` +
imdb$`Number of Votes` + imdb$`Movie Date` + imdb$`Time Duration (min)`),
data = imdb)
Residuals:
Min 1Q Median 3Q Max
-0.044910 -0.014569 -0.001016 0.015326 0.034504
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.601e+00 1.085e-01 60.82 <2e-16 ***
Metascore 1.216e-03 5.061e-05 24.03 <2e-16 ***
imdb$`Movie Revenue (M$)` -1.379e-10 3.108e-12 -44.36 <2e-16 ***
imdb$`Number of Votes` 1.122e-07 1.562e-09 71.85 <2e-16 ***
imdb$`Movie Date` -1.966e-03 5.376e-05 -36.57 <2e-16 ***
imdb$`Time Duration (min)` 2.465e-04 1.934e-05 12.74 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01886 on 1434 degrees of freedom
Multiple R-squared: 0.891, Adjusted R-squared: 0.8906
F-statistic: 2344 on 5 and 1434 DF, p-value: < 2.2e-16
Here, we have fitted our transformed model. We can see that Multiple R2 value is 0.891 and Adj R2 Value is 0.8906
We can also see that the P-value is smaller then 0.05, therefore we can say that the model is significant and we can reject H0.
To check the overall fit, we will use Anova function.
anova(sqrt)
Analysis of Variance Table
Response: sqrt(Score)
Df Sum Sq Mean Sq F value Pr(>F)
Metascore 1 0.89098 0.89098 2505.96 < 2.2e-16 ***
imdb$`Movie Revenue (M$)` 1 0.92234 0.92234 2594.16 < 2.2e-16 ***
imdb$`Number of Votes` 1 1.81745 1.81745 5111.72 < 2.2e-16 ***
imdb$`Movie Date` 1 0.47862 0.47862 1346.15 < 2.2e-16 ***
imdb$`Time Duration (min)` 1 0.05775 0.05775 162.42 < 2.2e-16 ***
Residuals 1434 0.50985 0.00036
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From Anova Test, we found out that P value for all the variables are 0.001 which is less than 0.005. Therefore,we can say that all the variables are statiscally significant as well as the model and the model is a good fit.
Now, we will check for assumptions by visualizing the model.
par(mfrow=c(2,2))
plot(sqrt)
After the transformation, we can say that:
In the 1st graph we can see that on X axis has predicted value known as Y^(y hat) and on y axis there are residuals and errors. Here we can the line is not flat and the points are looking as a bunch of clouds which means that linearity assumption is still violated here.
In the next Graph (QQ plot), here y axis is ordered, observed and Standardized residuals and on X axis it has ordered theoretical residuals. In the graph we can see that residuals are truly normally distributed as mostly points are falling on the line but some were falling out of line in the end. It is almost same as before transformation.
The next two shows that the regression is non-linear, non-constant variance. To test further, if any assumptions are violated or not. We will perform some tests.
durbinWatsonTest(sqrt)
lag Autocorrelation D-W Statistic p-value
1 -0.2087192 2.416894 0
Alternative hypothesis: rho != 0
H0: Errors are uncorrelated
H1: Errors are correlated
From the Durbin Waston test, Since the p-value is < 0.05 we have enough evidence to reject H0. This implies that uncorrelated error assumption is still violated.
shapiro.test(sqrt$residuals)
Shapiro-Wilk normality test
data: sqrt$residuals
W = 0.97359, p-value = 1.373e-15
H0: Errors are normally distributed
H1: Errors are not normally distributed
From the Shapiro-wilk normality test, Since the p-value is < 0.05 we have enough evidence to reject H0. This implies that normality error assumption is violated.
ncvTest(sqrt)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 8.728842, Df = 1, p = 0.0031322
H0: Errors have a constant variance
H1: Errors have a non-constant variance
Here, we can see that p value is less then 0.05. therefore, we can say that the constant variance assumptions is violated here.
As we can see that before transformation, few assumptions were violated. So we transform the data using square-root transformation and after transformation also, these assumptions were still violated.
Here, we have used Restriction variables and converted it into four category(1,2,3,4) and then build a multiple linear regression by including that variable and considering that as indicator variable using factor function.
imdb$Restriction[imdb$Restriction=="12"]<-"1"
imdb$Restriction[imdb$Restriction=="16"]<-"2"
imdb$Restriction[imdb$Restriction=="Tous publics"]<-"3"
imdb$Restriction[imdb$Restriction=="Tous publics avec avertissement"]<-"4"
imdb$Restriction <- factor(imdb$Restriction)
Here, we have build the linear regression model and named it as Model3 and now we will perform our analysis on this model.
Model3 <- lm(Score~ imdb$`Movie Date`+imdb$`Number of Votes`+imdb$`Movie Revenue (M$)`
+imdb$Metascore+imdb$`Time Duration (min)`+Restriction, imdb)
summary(Model3)
Call:
lm(formula = Score ~ imdb$`Movie Date` + imdb$`Number of Votes` +
imdb$`Movie Revenue (M$)` + imdb$Metascore + imdb$`Time Duration (min)` +
Restriction, data = imdb)
Residuals:
Min 1Q Median 3Q Max
-0.243150 -0.078436 0.002359 0.096447 0.203974
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.019e+01 6.159e-01 49.013 < 2e-16 ***
imdb$`Movie Date` -1.156e-02 3.052e-04 -37.879 < 2e-16 ***
imdb$`Number of Votes` 6.590e-07 8.818e-09 74.735 < 2e-16 ***
imdb$`Movie Revenue (M$)` -8.425e-10 1.992e-11 -42.298 < 2e-16 ***
imdb$Metascore 7.010e-03 2.913e-04 24.060 < 2e-16 ***
imdb$`Time Duration (min)` 1.392e-03 1.124e-04 12.377 < 2e-16 ***
Restriction2 -6.255e-02 1.168e-02 -5.355 9.95e-08 ***
Restriction3 2.057e-02 7.889e-03 2.608 0.00921 **
Restriction4 5.482e-03 1.055e-02 0.520 0.60346
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1063 on 1431 degrees of freedom
Multiple R-squared: 0.8982, Adjusted R-squared: 0.8976
F-statistic: 1579 on 8 and 1431 DF, p-value: < 2.2e-16
Here, we have fitted our model by considering restriction as our indicator variable. We can see that Multiple R2 value is 0.8982 and Adj R2 Value is 0.8976.
We can also see that the P-value is smaller then 0.05, therefore we can say that the model is significant and we can reject H0.
Only Variable restriction 4 has insignificant values as it’s p-value is greater then 0.05, rest all the variables have significant values.
To check the overall fit, we will use Anova function.
anova(Model3)
Analysis of Variance Table
Response: Score
Df Sum Sq Mean Sq F value Pr(>F)
imdb$`Movie Date` 1 34.100 34.100 3015.704 < 2.2e-16 ***
imdb$`Number of Votes` 1 81.233 81.233 7183.978 < 2.2e-16 ***
imdb$`Movie Revenue (M$)` 1 16.680 16.680 1475.158 < 2.2e-16 ***
imdb$Metascore 1 8.122 8.122 718.318 < 2.2e-16 ***
imdb$`Time Duration (min)` 1 2.029 2.029 179.455 < 2.2e-16 ***
Restriction 3 0.629 0.210 18.547 8.33e-12 ***
Residuals 1431 16.181 0.011
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From Anova Test, we found out that P value for all the variables are 0.001 which is less than 0.005. Therefore,we can say that all the variables are statiscally significant and the model is a good fit.
Now, we will check for assumptions by visualizing the model.
par(mfrow=c(2,2))
plot(Model3)
In the 1st graph we can see that on X axis has predicted value known as Y^(y hat) and on y axis there are residuals and errors. Here we can the line is not flat and the points are looking as a bunch of clouds which means that linearity assumption is violated here.
In the next Graph (QQ plot), here y axis is ordered, observed and Standardized residuals and on X axis it has ordered theoretical residuals. In the graph we can see that residuals are truly normally distributed as mostly points are falling on the line but some were falling out of line in the end.
The next two shows that the regression is non-linear, non-constant variance.
To test further, if any assumptions are violated or not. We will perform some tests.
durbinWatsonTest(Model3)
lag Autocorrelation D-W Statistic p-value
1 -0.2398101 2.478892 0
Alternative hypothesis: rho != 0
H0: Errors are uncorrelated
H1: Errors are correlated
From the Durbin Waston test, Since the p-value is < 0.05 we have enough evidence to reject H0. This implies that uncorrelated error assumption is violated.
shapiro.test(Model3$residuals)
Shapiro-Wilk normality test
data: Model3$residuals
W = 0.97762, p-value = 3.558e-14
H0: Errors are normally distributed
H1: Errors are not normally distributed
From the Shapiro-wilk normality test, Since the p-value is < 0.05 we enough evidence to reject H0. This implies that normality error assumption is violated.
ncvTest(Model3)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 3.45498, Df = 1, p = 0.063061
H0: Errors have a constant variance
H1: Errors have a non-constant variance
Here, we can see that p value > 0.05. therefore, we can say that the constant variance assumptions is violated here.
After performing the test, we found out that few assumptions were violated here. So, we will try to transform the data using Log Transformation.
We will make a new model and name it as log2 and fit the transformed value
log2 <- lm(log(Score)~ imdb$`Movie Date`+imdb$`Number of Votes`+imdb$`Movie Revenue (M$)`
+imdb$Metascore+imdb$`Time Duration (min)`+Restriction, imdb)
summary(log2)
Call:
lm(formula = log(Score) ~ imdb$`Movie Date` + imdb$`Number of Votes` +
imdb$`Movie Revenue (M$)` + imdb$Metascore + imdb$`Time Duration (min)` +
Restriction, data = imdb)
Residuals:
Min 1Q Median 3Q Max
-0.0301303 -0.0091947 0.0004506 0.0113366 0.0247214
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.660e+00 7.484e-02 62.265 < 2e-16 ***
imdb$`Movie Date` -1.343e-03 3.708e-05 -36.228 < 2e-16 ***
imdb$`Number of Votes` 7.689e-08 1.071e-09 71.771 < 2e-16 ***
imdb$`Movie Revenue (M$)` -1.016e-10 2.420e-12 -41.964 < 2e-16 ***
imdb$Metascore 8.265e-04 3.540e-05 23.349 < 2e-16 ***
imdb$`Time Duration (min)` 1.575e-04 1.366e-05 11.527 < 2e-16 ***
Restriction2 -7.515e-03 1.419e-03 -5.295 1.38e-07 ***
Restriction3 2.402e-03 9.586e-04 2.506 0.0123 *
Restriction4 1.061e-03 1.282e-03 0.827 0.4082
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01292 on 1431 degrees of freedom
Multiple R-squared: 0.8916, Adjusted R-squared: 0.891
F-statistic: 1471 on 8 and 1431 DF, p-value: < 2.2e-16
Here, we have fitted a transformed Multiple Linear Regression model . We can see that Multiple R2 value is 0.8916 and Adj R2 Value is 0.891. We can also see that the P-value is smaller then 0.05, therefore we can say that the model is significant and we can reject H0.
To check the overall fit, we will use Anova function.
anova(log2)
Analysis of Variance Table
Response: log(Score)
Df Sum Sq Mean Sq F value Pr(>F)
imdb$`Movie Date` 1 0.46903 0.46903 2809.836 < 2.2e-16 ***
imdb$`Number of Votes` 1 1.10096 1.10096 6595.595 < 2.2e-16 ***
imdb$`Movie Revenue (M$)` 1 0.24749 0.24749 1482.646 < 2.2e-16 ***
imdb$Metascore 1 0.11226 0.11226 672.509 < 2.2e-16 ***
imdb$`Time Duration (min)` 1 0.02628 0.02628 157.422 < 2.2e-16 ***
Restriction 3 0.00899 0.00300 17.943 1.965e-11 ***
Residuals 1431 0.23887 0.00017
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From Anova Test, we found out that P value for all the variables are 0.001 which is less than 0.005. Therefore,we can say that all the variables are statiscally significant and the model is a good fit.
Now, we will display the graphs to test the assumptions.
par(mfrow = c(2,2))
plot(log2)
we can see 4 Diagnostic plots known as Residuals plot, QQ plots, Scale-location plot and leverage plots.
In the 1st graph we can see that on X axis has predicted value known as Y^(y hat) and on y axis there are residuals and errors. Here we can the line is not flat and the points are looking as a bunch of clouds which means that linearity assumption is violated here.
In the next Graph (QQ plot), here y axis is ordered, observed and Standardized residuals and on X axis it has ordered theoretical residuals. In the graph we can see that residuals are truly normally distributed as mostly points are falling on the line but some were falling out of line in the end.
The next two shows that the regression is non-linear, non-constant variance.
To test further, if any assumptions are violated or not. We will perform some tests.
durbinWatsonTest(log2)
lag Autocorrelation D-W Statistic p-value
1 -0.2412033 2.481343 0
Alternative hypothesis: rho != 0
H0: Errors are uncorrelated
H1: Errors are correlated
From the Durbin Waston test, Since the p-value is < 0.05 we have enough evidence to reject H0. This implies that uncorrelated error assumption is not violated.
shapiro.test(log2$residuals)
Shapiro-Wilk normality test
data: log2$residuals
W = 0.9772, p-value = 2.482e-14
H0: Errors are normally distributed
H1: Errors are not normally distributed
From the Shapiro-wilk normality test, Since the p-value is < 0.05 we have enough evidence to reject H0. This implies that normality error assumption is violated.
ncvTest(log2)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 2.913237, Df = 1, p = 0.087855
H0: Errors have a constant variance
H1: Errors have a non-constant variance
Here, we can see that p value > 0.05. therefore, we can say that the constant variance assumptions is violated here.
As we can see that before transformation,few assumptions were violated. So we transform the data using log transformation and after transformation also, these assumption is still violated.
After building three different, we have to choose the best model. We will choose the most appropriate model on the basis of their Multiple R2 values.
Multiple R2 value of different models are:
1.) Model1: 0.8943, Transformed Model1: 0.8875
2.) Model2: 0.8943, Transformed Model2: 0.891
3.) Model3: 0.8982, Transformed Model3: 0.8916
So from the above values, we can say that Model3 is the best model if we observed as per their Multiple R2 values while Model 3 is still the best model if we see the Multiple r2 values of Transformed models.
Now we will perform prediction on the best model using different variables.
First we will use Metascore Variable.
prediction1 <- lm(Score~Metascore,imdb)
summary(prediction1)
Call:
lm(formula = Score ~ Metascore, data = imdb)
Residuals:
Min 1Q Median 3Q Max
-0.77185 -0.13538 0.02097 0.18229 0.76793
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.4712626 0.0562996 132.71 <2e-16 ***
Metascore 0.0132601 0.0007157 18.53 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2987 on 1438 degrees of freedom
Multiple R-squared: 0.1927, Adjusted R-squared: 0.1922
F-statistic: 343.3 on 1 and 1438 DF, p-value: < 2.2e-16
predict(prediction1,data.frame(Metascore = 100),interval="prediction", level = 0.95)
fit lwr upr
1 8.797269 8.210231 9.384308
The prediction interval using Metascore is [8.797,9.384].
Sthda.com. 2021. Correlation Test Between Two Variables in R - Easy Guides - Wiki - STHDA. [online] Available at: http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r [Accessed 3 June 2021].
R, B., 2021. Best Subsets Regression Essentials in R - Articles - STHDA. [online] Sthda.com. Available at: http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/155-best-subsets-regression-essentials-in-r/ [Accessed 3 June 2021].
ListenData. 2021. 15 Types of Regression in Data Science. [online] Available at: https://www.listendata.com/2018/03/regression-analysis.html [Accessed 3 June 2021].
Comments:
As we can see that before transformation, few assumptions were violated. So we transform the data using log transformation and after transformation also, these assumptions were still violated.