Movie Data Cleaning

Author

RB

Set Up

remove(list = ls())
library(visdat)
library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.5.2     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.3.0
✔ purrr     1.1.0     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ MASS::select()  masks dplyr::select()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Importing Data Set

movies_metadata <- read.csv("~/Boston College Experience - Data Analysis/movies_dataset/movies_metadata.csv")
vis_dat(movies_metadata[1:10000,])

df <- movies_metadata

Summary stats

?stargazer
stargazer(movies_metadata,                                    
          type   = "text",                            
          title  = "Summary Statistics",        
          digits = 1
          )

Summary Statistics
===============================================================
Statistic      N        Mean       St. Dev.   Min      Max     
---------------------------------------------------------------
revenue      45,460 11,209,348.0 64,332,247.0  0  2,787,965,087
runtime      45,203     94.1         38.4      0      1,256    
vote_average 45,460     5.6          1.9      0.0     10.0     
vote_count   45,460    109.9        491.3      0     14,075    
---------------------------------------------------------------

Cleaning data set

summary(df)            # See where NA values might be
    adult           belongs_to_collection    budget             genres         
 Length:45466       Length:45466          Length:45466       Length:45466      
 Class :character   Class :character      Class :character   Class :character  
 Mode  :character   Mode  :character      Mode  :character   Mode  :character  
                                                                               
                                                                               
                                                                               
                                                                               
   homepage              id              imdb_id          original_language 
 Length:45466       Length:45466       Length:45466       Length:45466      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 original_title       overview          popularity        poster_path       
 Length:45466       Length:45466       Length:45466       Length:45466      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 production_companies production_countries release_date      
 Length:45466         Length:45466         Length:45466      
 Class :character     Class :character     Class :character  
 Mode  :character     Mode  :character     Mode  :character  
                                                             
                                                             
                                                             
                                                             
    revenue             runtime        spoken_languages      status         
 Min.   :0.000e+00   Min.   :   0.00   Length:45466       Length:45466      
 1st Qu.:0.000e+00   1st Qu.:  85.00   Class :character   Class :character  
 Median :0.000e+00   Median :  95.00   Mode  :character   Mode  :character  
 Mean   :1.121e+07   Mean   :  94.13                                        
 3rd Qu.:0.000e+00   3rd Qu.: 107.00                                        
 Max.   :2.788e+09   Max.   :1256.00                                        
 NA's   :6           NA's   :263                                            
   tagline             title              video            vote_average   
 Length:45466       Length:45466       Length:45466       Min.   : 0.000  
 Class :character   Class :character   Class :character   1st Qu.: 5.000  
 Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
                                                          Mean   : 5.618  
                                                          3rd Qu.: 6.800  
                                                          Max.   :10.000  
                                                          NA's   :6       
   vote_count     
 Min.   :    0.0  
 1st Qu.:    3.0  
 Median :   10.0  
 Mean   :  109.9  
 3rd Qu.:   34.0  
 Max.   :14075.0  
 NA's   :6        
df <- df %>%           # converting status into numerical values
  mutate(status = ifelse(status == "Released", 1, 0))

clean_df <- df %>%
 dplyr::select(
    -title,
    -adult,
    -status,
    -overview,
    -release_date,
    -video,
    -belongs_to_collection,
    -original_title,
    -homepage,
    -poster_path,
    -id,
    -imdb_id,
    -tagline
  )

Replacing missing values

clean_df$runtime[is.na(clean_df$runtime)] <- median(clean_df$runtime, na.rm = TRUE)                  # replace missing values in runtime with the median


clean_df$vote_average[is.na(clean_df$vote_average)] <- median(clean_df$vote_average, na.rm = TRUE)    
                      # replace missing values in vote_average with the median


clean_df$revenue[is.na(clean_df$revenue)] <- median(clean_df$revenue, na.rm = TRUE)                 # replace missing values in revenue with the median


clean_df$vote_count[is.na(clean_df$vote_count)] <- median(clean_df$vote_count, na.rm = TRUE)       # replace missing values in vote_count with the median

Cleaning data

str(clean_df)                             # what data types each variable is
'data.frame':   45466 obs. of  11 variables:
 $ budget              : chr  "30000000" "65000000" "0" "16000000" ...
 $ genres              : chr  "[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]" "[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]" "[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]" "[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]" ...
 $ original_language   : chr  "en" "en" "en" "en" ...
 $ popularity          : chr  "21.946943" "17.015539" "11.7129" "3.859495" ...
 $ production_companies: chr  "[{'name': 'Pixar Animation Studios', 'id': 3}]" "[{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communicat"| __truncated__ "[{'name': 'Warner Bros.', 'id': 6194}, {'name': 'Lancaster Gate', 'id': 19464}]" "[{'name': 'Twentieth Century Fox Film Corporation', 'id': 306}]" ...
 $ production_countries: chr  "[{'iso_3166_1': 'US', 'name': 'United States of America'}]" "[{'iso_3166_1': 'US', 'name': 'United States of America'}]" "[{'iso_3166_1': 'US', 'name': 'United States of America'}]" "[{'iso_3166_1': 'US', 'name': 'United States of America'}]" ...
 $ revenue             : num  3.74e+08 2.63e+08 0.00 8.15e+07 7.66e+07 ...
 $ runtime             : num  81 104 101 127 106 170 127 97 106 130 ...
 $ spoken_languages    : chr  "[{'iso_639_1': 'en', 'name': 'English'}]" "[{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]" "[{'iso_639_1': 'en', 'name': 'English'}]" "[{'iso_639_1': 'en', 'name': 'English'}]" ...
 $ vote_average        : num  7.7 6.9 6.5 6.1 5.7 7.7 6.2 5.4 5.5 6.6 ...
 $ vote_count          : num  5415 2413 92 34 173 ...
clean_df$budget <- as.integer(clean_df$budget)  # converting variables -> numeric/integer
Warning: NAs introduced by coercion
clean_df$popularity <- as.numeric(df$popularity)
Warning: NAs introduced by coercion
clean_df <- clean_df %>%                  #converting char -> factor to run kitchen sink model
  mutate(across(where(is.character), as.factor))

Kitchen sink

# Create the simplified dataframe for the model
model_df <- clean_df %>%
  dplyr::select( 
    -genres,
    -production_companies,
    -production_countries,
    -spoken_languages
  )

# Fit the model using the NEW `model_df` dataframe
reg1 <- lm(
  formula = revenue ~ budget + original_language + popularity + runtime + vote_average + vote_count,
  data = clean_df
  )

# View the summary of the full model
summary(reg1)

Call:
lm(formula = revenue ~ budget + original_language + popularity + 
    runtime + vote_average + vote_count, data = clean_df)

Residuals:
       Min         1Q     Median         3Q        Max 
-675681781    -536040    1996412    3620461 1536326966 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          2.601e+06  9.721e+06   0.268    0.789    
budget               1.483e+00  1.193e-02 124.340  < 2e-16 ***
original_languageab  6.587e+05  1.408e+07   0.047    0.963    
original_languageaf  9.776e+05  2.477e+07   0.039    0.969    
original_languageam  4.402e+06  2.478e+07   0.178    0.859    
original_languagear  1.898e+06  1.101e+07   0.172    0.863    
original_languageay  1.938e+06  3.366e+07   0.058    0.954    
original_languagebg  1.996e+06  1.408e+07   0.142    0.887    
original_languagebm  1.323e+06  2.099e+07   0.063    0.950    
original_languagebn  2.827e+06  1.142e+07   0.248    0.804    
original_languagebo  7.535e+05  2.477e+07   0.030    0.976    
original_languagebs  9.181e+05  1.299e+07   0.071    0.944    
original_languageca -9.320e+05  1.345e+07  -0.069    0.945    
original_languagecn -7.692e+05  9.889e+06  -0.078    0.938    
original_languagecs  1.562e+06  1.012e+07   0.154    0.877    
original_languagecy  3.505e+06  3.366e+07   0.104    0.917    
original_languageda -1.271e+06  9.954e+06  -0.128    0.898    
original_languagede  1.880e+05  9.768e+06   0.019    0.985    
original_languageel  2.108e+06  1.018e+07   0.207    0.836    
original_languageen -1.736e+06  9.720e+06  -0.179    0.858    
original_languageeo  8.895e+05  3.366e+07   0.026    0.979    
original_languagees -1.852e+05  9.772e+06  -0.019    0.985    
original_languageet  1.049e+06  1.174e+07   0.089    0.929    
original_languageeu  9.100e+05  2.099e+07   0.043    0.965    
original_languagefa  1.067e+06  1.023e+07   0.104    0.917    
original_languagefi  8.026e+05  9.896e+06   0.081    0.935    
original_languagefr -2.188e+06  9.741e+06  -0.225    0.822    
original_languagefy -4.395e+05  3.366e+07  -0.013    0.990    
original_languagegl  2.482e+06  3.366e+07   0.074    0.941    
original_languagehe  9.287e+05  1.049e+07   0.089    0.929    
original_languagehi  3.884e+06  9.828e+06   0.395    0.693    
original_languagehr  2.300e+06  1.141e+07   0.201    0.840    
original_languagehu  1.199e+06  1.024e+07   0.117    0.907    
original_languagehy  1.751e+06  3.366e+07   0.052    0.959    
original_languageid -5.725e+06  1.210e+07  -0.473    0.636    
original_languageis  8.128e+05  1.174e+07   0.069    0.945    
original_languageit -6.843e+05  9.753e+06  -0.070    0.944    
original_languageiu  2.896e+06  2.477e+07   0.117    0.907    
original_languageja  7.455e+05  9.760e+06   0.076    0.939    
original_languagejv  2.797e+06  3.366e+07   0.083    0.934    
original_languageka  1.881e+06  1.233e+07   0.153    0.879    
original_languagekk  1.966e+06  2.099e+07   0.094    0.925    
original_languagekn  2.532e+06  2.099e+07   0.121    0.904    
original_languageko  2.093e+05  9.841e+06   0.021    0.983    
original_languageku  3.910e+05  2.099e+07   0.019    0.985    
original_languageky  1.002e+06  2.099e+07   0.048    0.962    
original_languagela  2.113e+06  3.366e+07   0.063    0.950    
original_languagelb -2.499e+05  3.366e+07  -0.007    0.994    
original_languagelo  3.934e+05  2.477e+07   0.016    0.987    
original_languagelt  1.657e+06  1.449e+07   0.114    0.909    
original_languagelv  1.272e+06  1.233e+07   0.103    0.918    
original_languagemk  2.701e+06  1.738e+07   0.155    0.877    
original_languageml  3.556e+06  1.111e+07   0.320    0.749    
original_languagemn  3.945e+05  2.477e+07   0.016    0.987    
original_languagemr  2.267e+06  1.166e+07   0.194    0.846    
original_languagems  1.538e+05  1.738e+07   0.009    0.993    
original_languagemt -2.208e+05  3.366e+07  -0.007    0.995    
original_languagenb -6.186e+06  1.636e+07  -0.378    0.705    
original_languagene  2.036e+06  2.477e+07   0.082    0.935    
original_languagenl  6.018e+05  9.931e+06   0.061    0.952    
original_languageno -9.794e+05  1.021e+07  -0.096    0.924    
original_languagepa  2.794e+06  2.477e+07   0.113    0.910    
original_languagepl  1.400e+06  9.960e+06   0.141    0.888    
original_languageps  1.046e+06  2.477e+07   0.042    0.966    
original_languagept  3.227e+05  9.887e+06   0.033    0.974    
original_languagequ  1.440e+06  3.366e+07   0.043    0.966    
original_languagero  1.414e+06  1.062e+07   0.133    0.894    
original_languageru  6.389e+05  9.784e+06   0.065    0.948    
original_languagerw  2.806e+06  3.366e+07   0.083    0.934    
original_languagesh  6.505e+05  1.738e+07   0.037    0.970    
original_languagesi  1.438e+06  3.366e+07   0.043    0.966    
original_languagesk  2.381e+06  1.449e+07   0.164    0.869    
original_languagesl  1.485e+06  1.247e+07   0.119    0.905    
original_languagesm  3.249e+06  3.366e+07   0.097    0.923    
original_languagesq  1.785e+06  1.738e+07   0.103    0.918    
original_languagesr  2.943e+04  1.053e+07   0.003    0.998    
original_languagesv  1.530e+05  9.857e+06   0.016    0.988    
original_languageta  6.920e+06  1.039e+07   0.666    0.505    
original_languagete  8.841e+06  1.085e+07   0.815    0.415    
original_languagetg  2.702e+06  3.366e+07   0.080    0.936    
original_languageth  1.540e+04  1.040e+07   0.001    0.999    
original_languagetl  2.846e+06  1.182e+07   0.241    0.810    
original_languagetr  2.264e+06  1.007e+07   0.225    0.822    
original_languageuk  1.305e+06  1.262e+07   0.103    0.918    
original_languageur  3.283e+06  1.498e+07   0.219    0.826    
original_languageuz -5.596e+05  3.366e+07  -0.017    0.987    
original_languagevi  1.577e+06  1.408e+07   0.112    0.911    
original_languagewo  3.152e+06  1.738e+07   0.181    0.856    
original_languagexx -2.518e+05  1.122e+07  -0.022    0.982    
original_languagezh  3.175e+06  9.851e+06   0.322    0.747    
original_languagezu  2.943e+06  3.366e+07   0.087    0.930    
popularity           4.121e+05  3.095e+04  13.315  < 2e-16 ***
runtime             -2.402e+04  4.103e+03  -5.855 4.80e-09 ***
vote_average        -5.230e+05  8.136e+04  -6.428 1.31e-10 ***
vote_count           6.857e+04  4.542e+02 150.944  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 32220000 on 45365 degrees of freedom
  (6 observations deleted due to missingness)
Multiple R-squared:  0.7496,    Adjusted R-squared:  0.7491 
F-statistic:  1445 on 94 and 45365 DF,  p-value: < 2.2e-16

Backward selection

backward_model <- stepAIC(reg1, direction = "backward")
Start:  AIC=1571939
revenue ~ budget + original_language + popularity + runtime + 
    vote_average + vote_count

                    Df  Sum of Sq        RSS     AIC
- original_language 89 6.8590e+16 4.7174e+19 1571827
<none>                            4.7105e+19 1571939
- runtime            1 3.5597e+16 4.7141e+19 1571971
- vote_average       1 4.2905e+16 4.7148e+19 1571978
- popularity         1 1.8410e+17 4.7289e+19 1572114
- budget             1 1.6054e+19 6.3159e+19 1585269
- vote_count         1 2.3658e+19 7.0764e+19 1590437

Step:  AIC=1571827
revenue ~ budget + popularity + runtime + vote_average + vote_count

               Df  Sum of Sq        RSS     AIC
<none>                       4.7174e+19 1571827
- runtime       1 2.6159e+16 4.7200e+19 1571850
- vote_average  1 3.5473e+16 4.7209e+19 1571859
- popularity    1 1.6808e+17 4.7342e+19 1571987
- budget        1 1.6063e+19 6.3237e+19 1585147
- vote_count    1 2.3657e+19 7.0830e+19 1590302
# 4. Summarize the final model after backward selection
summary(backward_model)

Call:
lm(formula = revenue ~ budget + popularity + runtime + vote_average + 
    vote_count, data = clean_df)

Residuals:
       Min         1Q     Median         3Q        Max 
-674842534      25237    2184621    3367999 1540275235 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.261e+05  5.601e+05   1.475     0.14    
budget        1.480e+00  1.189e-02 124.409  < 2e-16 ***
popularity    3.919e+05  3.080e+04  12.726  < 2e-16 ***
runtime      -2.025e+04  4.033e+03  -5.020 5.17e-07 ***
vote_average -4.699e+05  8.037e+04  -5.846 5.06e-09 ***
vote_count    6.854e+04  4.540e+02 150.977  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 32220000 on 45454 degrees of freedom
  (6 observations deleted due to missingness)
Multiple R-squared:  0.7493,    Adjusted R-squared:  0.7492 
F-statistic: 2.716e+04 on 5 and 45454 DF,  p-value: < 2.2e-16