Library

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(corrplot)
## corrplot 0.95 loaded
library(caret)
## Loading required package: lattice
library(caTools)
## Warning: package 'caTools' was built under R version 4.4.2

I. Problem Statement

Industri perfilman menghadapi tantangan besar dalam memprediksi keberhasilan film di pasar yang kompetitif. Memahami faktor-faktor kunci yang mempengaruhi koleksi pendapatan film sangat penting bagi produser, studio, dan investor untuk membuat keputusan strategis. Oleh karena itu, diperlukan pemodelan prediktif untuk mengidentifikasi variabel-variabel yang berkontribusi terhadap kesuksesan film, seperti anggaran, pengeluaran pemasaran, durasi film, dan penilaian dari aktor utama, sutradara, dan kritikus. Penelitian ini bertujuan untuk membangun model klasifikasi yang dapat memprediksi keberhasilan koleksi film berdasarkan data historis, sehingga dapat memberikan wawasan yang lebih dalam tentang faktor-faktor yang menentukan keberhasilan film di box office.

##Daftar Langkah 1. Problem Statement 2. Read Data + Data Understanding 3. Data Wrangling 4. EDA 5. Cross Validation 6. Build Model 7. Model Selection & Assumption 8. Predict 9. Model Evaluation

II Read Data + Data Understanding

data <- read.csv("kaggle/Movie_classification.csv")
str(data)
## 'data.frame':    506 obs. of  19 variables:
##  $ Marketing.expense  : num  20.1 20.5 20.5 20.6 21.4 ...
##  $ Production.expense : num  59.6 69.1 69.1 59.4 59.4 ...
##  $ Multiplex.coverage : num  0.462 0.531 0.531 0.542 0.542 0.542 0.476 0.476 0.476 0.476 ...
##  $ Budget             : num  36524 35669 39913 38874 39702 ...
##  $ Movie_length       : num  139 152 135 119 128 ...
##  $ Lead_.Actor_Rating : num  7.83 7.5 7.49 6.89 6.92 ...
##  $ Lead_Actress_rating: num  8.1 7.65 7.57 7.04 7.07 ...
##  $ Director_rating    : num  7.91 7.44 7.5 6.92 6.82 ...
##  $ Producer_rating    : num  8 7.47 7.51 7.02 7.07 ...
##  $ Critic_rating      : num  7.94 7.44 7.44 8.26 8.26 7.26 8.96 7.96 7.96 7.96 ...
##  $ Trailer_views      : int  527367 494055 547051 516279 531448 498425 459241 400821 295168 412012 ...
##  $ X3D_available      : chr  "YES" "NO" "NO" "YES" ...
##  $ Time_taken         : num  110 147 148 185 176 ...
##  $ Twitter_hastags    : num  224 243 2022 225 226 ...
##  $ Genre              : chr  "Thriller" "Drama" "Comedy" "Drama" ...
##  $ Avg_age_actors     : int  23 42 38 45 55 53 41 56 55 45 ...
##  $ Num_multiplex      : int  494 462 458 472 395 460 522 571 564 508 ...
##  $ Collection         : int  48000 43200 69400 66800 72400 57400 45800 44200 33000 37800 ...
##  $ Start_Tech_Oscar   : int  1 0 1 1 1 0 0 0 1 1 ...
summary(data)
##  Marketing.expense Production.expense Multiplex.coverage     Budget     
##  Min.   :  20.13   Min.   : 55.92     Min.   :0.1290     Min.   :19781  
##  1st Qu.:  21.64   1st Qu.: 65.38     1st Qu.:0.3760     1st Qu.:32694  
##  Median :  25.13   Median : 74.38     Median :0.4620     Median :34488  
##  Mean   :  92.27   Mean   : 77.27     Mean   :0.4453     Mean   :34911  
##  3rd Qu.:  93.54   3rd Qu.: 91.20     3rd Qu.:0.5510     3rd Qu.:36794  
##  Max.   :1799.52   Max.   :110.48     Max.   :0.6150     Max.   :48773  
##                                                                         
##   Movie_length   Lead_.Actor_Rating Lead_Actress_rating Director_rating
##  Min.   : 76.4   Min.   :3.840      Min.   :4.035       Min.   :3.840  
##  1st Qu.:118.5   1st Qu.:7.316      1st Qu.:7.504       1st Qu.:7.296  
##  Median :151.0   Median :8.307      Median :8.495       Median :8.312  
##  Mean   :142.1   Mean   :8.014      Mean   :8.186       Mean   :8.020  
##  3rd Qu.:167.6   3rd Qu.:8.865      3rd Qu.:9.030       3rd Qu.:8.884  
##  Max.   :173.5   Max.   :9.435      Max.   :9.540       Max.   :9.425  
##                                                                        
##  Producer_rating Critic_rating   Trailer_views    X3D_available     
##  Min.   :4.030   Min.   :6.600   Min.   :212912   Length:506        
##  1st Qu.:7.508   1st Qu.:7.200   1st Qu.:409128   Class :character  
##  Median :8.465   Median :7.960   Median :462460   Mode  :character  
##  Mean   :8.191   Mean   :7.811   Mean   :449861                     
##  3rd Qu.:9.030   3rd Qu.:8.260   3rd Qu.:500248                     
##  Max.   :9.635   Max.   :9.400   Max.   :567784                     
##                                                                     
##    Time_taken    Twitter_hastags     Genre           Avg_age_actors 
##  Min.   :  0.0   Min.   : 201.2   Length:506         Min.   : 3.00  
##  1st Qu.:132.3   1st Qu.: 223.8   Class :character   1st Qu.:28.00  
##  Median :160.0   Median : 254.4   Mode  :character   Median :39.00  
##  Mean   :157.4   Mean   : 260.8                      Mean   :39.18  
##  3rd Qu.:181.9   3rd Qu.: 283.4                      3rd Qu.:50.00  
##  Max.   :217.5   Max.   :2022.4                      Max.   :60.00  
##  NA's   :12                                                         
##  Num_multiplex     Collection     Start_Tech_Oscar
##  Min.   :333.0   Min.   : 10000   Min.   :0.0000  
##  1st Qu.:465.0   1st Qu.: 34050   1st Qu.:0.0000  
##  Median :535.5   Median : 42400   Median :1.0000  
##  Mean   :545.0   Mean   : 45058   Mean   :0.5455  
##  3rd Qu.:614.8   3rd Qu.: 50000   3rd Qu.:1.0000  
##  Max.   :868.0   Max.   :100000   Max.   :1.0000  
## 

III Data Wrangling

# Mengganti nilai missing di 'Time_taken' dengan median
data$Time_taken[is.na(data$Time_taken)] <- median(data$Time_taken, na.rm = TRUE)

# Encoding variabel kategorikal 'X3D_available' dan 'Genre'
data$X3D_available <- as.factor(ifelse(data$X3D_available == "YES", 1, 0))
data$Genre <- as.factor(data$Genre)

IV Exploratory Data Analysis (EDA)

#Buat Plot
ggplot(data, aes(x = Collection)) + geom_histogram(binwidth = 5000, fill = "blue", color = "black") + 
  theme_minimal() + labs(title = "Distribusi Koleksi")

#Buat Corrplot
numeric_data <- data[sapply(data, is.numeric)]
corr_matrix <- cor(numeric_data)
corrplot(corr_matrix, method = "color")

# Memilih hanya kolom numerik dari data frame
numeric_data <- data[, sapply(data, is.numeric)]

# Menampilkan matriks korelasi
correlation_matrix <- cor(numeric_data, use = "complete.obs")
print(correlation_matrix)
##                     Marketing.expense Production.expense Multiplex.coverage
## Marketing.expense          1.00000000       0.4065834114       -0.420971711
## Production.expense         0.40658341       1.0000000000       -0.763651447
## Multiplex.coverage        -0.42097171      -0.7636514469        1.000000000
## Budget                    -0.21924670      -0.3916758527        0.302188188
## Movie_length               0.35273425       0.6447785114       -0.731470104
## Lead_.Actor_Rating         0.38005024       0.7064813315       -0.768589016
## Lead_Actress_rating        0.37981329       0.7079564568       -0.769724000
## Director_rating            0.38006890       0.7075663557       -0.769156999
## Producer_rating            0.37646167       0.7058188780       -0.764873344
## Critic_rating             -0.18498503      -0.2515646606        0.145555231
## Trailer_views             -0.44345715      -0.5916570937        0.581386123
## Time_taken                 0.02610779       0.0157361720        0.034891155
## Twitter_hastags            0.01351787      -0.0008386304        0.004881587
## Avg_age_actors             0.05920392       0.0558098001       -0.092103994
## Num_multiplex              0.38329769       0.7075585134       -0.915494613
## Collection                -0.38958244      -0.4847543793        0.429300219
## Start_Tech_Oscar          -0.01341676      -0.0244044234       -0.004017022
##                          Budget Movie_length Lead_.Actor_Rating
## Marketing.expense   -0.21924670  0.352734251         0.38005024
## Production.expense  -0.39167585  0.644778511         0.70648133
## Multiplex.coverage   0.30218819 -0.731470104        -0.76858902
## Budget               1.00000000 -0.240264931        -0.20846434
## Movie_length        -0.24026493  1.000000000         0.74690431
## Lead_.Actor_Rating  -0.20846434  0.746904308         1.00000000
## Lead_Actress_rating -0.20398069  0.746493103         0.99790453
## Director_rating     -0.20190700  0.747020751         0.99773453
## Producer_rating     -0.20539683  0.746706615         0.99407321
## Critic_rating        0.23236071 -0.217830158        -0.16997774
## Trailer_views        0.60253604 -0.589317762        -0.49026692
## Time_taken           0.03969711 -0.019136275         0.03852357
## Twitter_hastags      0.03067351  0.009379579         0.01446283
## Avg_age_actors      -0.06469390  0.075197765         0.03679436
## Num_multiplex       -0.28279597  0.673895783         0.70633064
## Collection           0.69630379 -0.377998896        -0.25135494
## Start_Tech_Oscar    -0.02714765  0.016291411        -0.03530851
##                     Lead_Actress_rating Director_rating Producer_rating
## Marketing.expense            0.37981329      0.38006890     0.376461674
## Production.expense           0.70795646      0.70756636     0.705818878
## Multiplex.coverage          -0.76972400     -0.76915700    -0.764873344
## Budget                      -0.20398069     -0.20190700    -0.205396835
## Movie_length                 0.74649310      0.74702075     0.746706615
## Lead_.Actor_Rating           0.99790453      0.99773453     0.994073208
## Lead_Actress_rating          1.00000000      0.99809682     0.994002603
## Director_rating              0.99809682      1.00000000     0.994125524
## Producer_rating              0.99400260      0.99412552     1.000000000
## Critic_rating               -0.16599207     -0.16663786    -0.167003414
## Trailer_views               -0.48753605     -0.48645183    -0.487910743
## Time_taken                   0.03840683      0.03629349     0.029083451
## Twitter_hastags              0.01023920      0.01007678     0.005849935
## Avg_age_actors               0.03800505      0.04146959     0.032542133
## Num_multiplex                0.70825662      0.70936354     0.703517871
## Collection                  -0.24945945     -0.24665004    -0.248200181
## Start_Tech_Oscar            -0.04035558     -0.03576832    -0.043612169
##                     Critic_rating Trailer_views   Time_taken Twitter_hastags
## Marketing.expense    -0.184985034  -0.443457148  0.026107787    0.0135178715
## Production.expense   -0.251564661  -0.591657094  0.015736172   -0.0008386304
## Multiplex.coverage    0.145555231   0.581386123  0.034891155    0.0048815872
## Budget                0.232360714   0.602536042  0.039697111    0.0306735090
## Movie_length         -0.217830158  -0.589317762 -0.019136275    0.0093795793
## Lead_.Actor_Rating   -0.169977738  -0.490266915  0.038523569    0.0144628347
## Lead_Actress_rating  -0.165992070  -0.487536047  0.038406833    0.0102391975
## Director_rating      -0.166637855  -0.486451834  0.036293493    0.0100767796
## Producer_rating      -0.167003414  -0.487910743  0.029083451    0.0058499350
## Critic_rating         1.000000000   0.228641376 -0.015897790   -0.0236552818
## Trailer_views         0.228641376   1.000000000  0.073588662   -0.0067042514
## Time_taken           -0.015897790   0.073588662  1.000000000   -0.0065635530
## Twitter_hastags      -0.023655282  -0.006704251 -0.006563553    1.0000000000
## Avg_age_actors       -0.049797149  -0.049725754  0.071199513   -0.0048398133
## Num_multiplex        -0.128769284  -0.544099603 -0.056093462    0.0062551706
## Collection            0.341288410   0.720119419  0.109127690    0.0231221267
## Start_Tech_Oscar     -0.001084021  -0.075782817 -0.063930342    0.0773331176
##                     Avg_age_actors Num_multiplex  Collection Start_Tech_Oscar
## Marketing.expense      0.059203919   0.383297685 -0.38958244     -0.013416763
## Production.expense     0.055809800   0.707558513 -0.48475438     -0.024404423
## Multiplex.coverage    -0.092103994  -0.915494613  0.42930022     -0.004017022
## Budget                -0.064693897  -0.282795967  0.69630379     -0.027147654
## Movie_length           0.075197765   0.673895783 -0.37799890      0.016291411
## Lead_.Actor_Rating     0.036794363   0.706330636 -0.25135494     -0.035308511
## Lead_Actress_rating    0.038005053   0.708256617 -0.24945945     -0.040355577
## Director_rating        0.041469589   0.709363544 -0.24665004     -0.035768322
## Producer_rating        0.032542133   0.703517871 -0.24820018     -0.043612169
## Critic_rating         -0.049797149  -0.128769284  0.34128841     -0.001084021
## Trailer_views         -0.049725754  -0.544099603  0.72011942     -0.075782817
## Time_taken             0.071199513  -0.056093462  0.10912769     -0.063930342
## Twitter_hastags       -0.004839813   0.006255171  0.02312213      0.077333118
## Avg_age_actors         1.000000000   0.078811018 -0.04742623      0.040580566
## Num_multiplex          0.078811018   1.000000000 -0.39172857     -0.004857210
## Collection            -0.047426233  -0.391728569  1.00000000      0.154698364
## Start_Tech_Oscar       0.040580566  -0.004857210  0.15469836      1.000000000

Positive Correlation Lead Actor Rating, Lead Actress Rating, Director Rating dan Producer Rating memiliki korelasi yang positif (sekitar 0.99), menunjukkan rating yang tinggi di salah satu variabel akan memiliki rating yang tinggi juga di variabel yang lain. Korelasi antara production Expense dengan Movie Length (0.64) menunjukkan film yang berdurasi lebih panjang memiliki biaya produksi yang lebih tinggi. Collection dengan Budget (0.70) menunjukkan film dengan budget lebih tinggi meningkatkan kemungkinan film tersebut menghasilkan collection box office yang lebih tinggi.

Negative Correlation Multiplex Coverage dan Movie Length. Film berdurasi panjang sering menunjukkan multiplexes yang lebih sedikit.

# Buat variable high collection berdasarkan mediannya agar bisa dibuat menjadi logistic regression
data$high_collection <- ifelse(data$Collection > median(data$Collection, na.rm = TRUE), 1, 0)
str(data)
## 'data.frame':    506 obs. of  20 variables:
##  $ Marketing.expense  : num  20.1 20.5 20.5 20.6 21.4 ...
##  $ Production.expense : num  59.6 69.1 69.1 59.4 59.4 ...
##  $ Multiplex.coverage : num  0.462 0.531 0.531 0.542 0.542 0.542 0.476 0.476 0.476 0.476 ...
##  $ Budget             : num  36524 35669 39913 38874 39702 ...
##  $ Movie_length       : num  139 152 135 119 128 ...
##  $ Lead_.Actor_Rating : num  7.83 7.5 7.49 6.89 6.92 ...
##  $ Lead_Actress_rating: num  8.1 7.65 7.57 7.04 7.07 ...
##  $ Director_rating    : num  7.91 7.44 7.5 6.92 6.82 ...
##  $ Producer_rating    : num  8 7.47 7.51 7.02 7.07 ...
##  $ Critic_rating      : num  7.94 7.44 7.44 8.26 8.26 7.26 8.96 7.96 7.96 7.96 ...
##  $ Trailer_views      : int  527367 494055 547051 516279 531448 498425 459241 400821 295168 412012 ...
##  $ X3D_available      : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 2 2 ...
##  $ Time_taken         : num  110 147 148 185 176 ...
##  $ Twitter_hastags    : num  224 243 2022 225 226 ...
##  $ Genre              : Factor w/ 4 levels "Action","Comedy",..: 4 3 2 3 3 2 4 3 2 4 ...
##  $ Avg_age_actors     : int  23 42 38 45 55 53 41 56 55 45 ...
##  $ Num_multiplex      : int  494 462 458 472 395 460 522 571 564 508 ...
##  $ Collection         : int  48000 43200 69400 66800 72400 57400 45800 44200 33000 37800 ...
##  $ Start_Tech_Oscar   : int  1 0 1 1 1 0 0 0 1 1 ...
##  $ high_collection    : num  1 1 1 1 1 1 1 1 0 0 ...

V. Cross Validation

# Set seed
set.seed(88)

# Split data into training (80%) and test (20%) sets
split <- sample.split(data$high_collection, SplitRatio = 0.8)
train_data <- subset(data, split == TRUE)
test_data <- subset(data, split == FALSE)

Buat Data Training dan Data Testing 80-20

VI. Build Model

# Fit logistic regression model
model_all <- glm(high_collection ~ ., 
              data = train_data, family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_all)
## 
## Call:
## glm(formula = high_collection ~ ., family = "binomial", data = train_data)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)         -1.247e+03  3.263e+05  -0.004    0.997
## Marketing.expense    5.561e-03  9.649e+01   0.000    1.000
## Production.expense   9.144e-01  4.339e+02   0.002    0.998
## Multiplex.coverage   6.490e+00  1.280e+05   0.000    1.000
## Budget              -1.859e-03  2.791e+00  -0.001    0.999
## Movie_length         4.811e-01  4.939e+02   0.001    0.999
## Lead_.Actor_Rating   1.160e+01  8.660e+04   0.000    1.000
## Lead_Actress_rating -2.381e+01  1.531e+05   0.000    1.000
## Director_rating      3.753e+01  2.165e+05   0.000    1.000
## Producer_rating     -3.738e+01  7.343e+04  -0.001    1.000
## Critic_rating        1.673e+00  1.921e+04   0.000    1.000
## Trailer_views        1.620e-04  2.487e-01   0.001    0.999
## X3D_available1       8.518e+00  1.910e+04   0.000    1.000
## Time_taken          -2.169e-01  2.693e+02  -0.001    0.999
## Twitter_hastags     -1.897e-01  2.542e+02  -0.001    0.999
## GenreComedy         -1.003e+01  2.691e+04   0.000    1.000
## GenreDrama          -1.825e+01  2.045e+04  -0.001    0.999
## GenreThriller       -1.735e+01  1.407e+04  -0.001    0.999
## Avg_age_actors      -1.297e-01  7.331e+02   0.000    1.000
## Num_multiplex       -1.963e-02  2.630e+02   0.000    1.000
## Collection           3.038e-02  2.547e+00   0.012    0.990
## Start_Tech_Oscar     1.630e+00  2.136e+04   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5.6139e+02  on 404  degrees of freedom
## Residual deviance: 3.9995e-07  on 383  degrees of freedom
## AIC: 44
## 
## Number of Fisher Scoring iterations: 25
# Fit logistic regression model
model <- glm(high_collection ~ Budget + Marketing.expense + Production.expense + 
              Multiplex.coverage + Movie_length + Lead_.Actor_Rating + 
              Lead_Actress_rating + Director_rating + Producer_rating + 
              Critic_rating, 
              data = train_data, family = "binomial")

summary(model)
## 
## Call:
## glm(formula = high_collection ~ Budget + Marketing.expense + 
##     Production.expense + Multiplex.coverage + Movie_length + 
##     Lead_.Actor_Rating + Lead_Actress_rating + Director_rating + 
##     Producer_rating + Critic_rating, family = "binomial", data = train_data)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -2.383e+01  4.630e+00  -5.146 2.66e-07 ***
## Budget               5.610e-04  7.918e-05   7.085 1.39e-12 ***
## Marketing.expense   -9.657e-03  2.796e-03  -3.454 0.000552 ***
## Production.expense  -7.115e-04  1.769e-02  -0.040 0.967910    
## Multiplex.coverage   1.640e+00  2.459e+00   0.667 0.504907    
## Movie_length        -5.119e-02  9.546e-03  -5.362 8.22e-08 ***
## Lead_.Actor_Rating   3.206e+00  2.465e+00   1.301 0.193339    
## Lead_Actress_rating -5.922e+00  2.614e+00  -2.265 0.023495 *  
## Director_rating      3.296e+00  2.552e+00   1.291 0.196607    
## Producer_rating      3.719e-01  1.349e+00   0.276 0.782791    
## Critic_rating        6.264e-01  2.489e-01   2.517 0.011839 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 561.39  on 404  degrees of freedom
## Residual deviance: 291.88  on 394  degrees of freedom
## AIC: 313.88
## 
## Number of Fisher Scoring iterations: 7

VII. Model Selection & Assumption

Observasi model_all Coefficients : Banyak coefficient yang memiliki error yang tinggi dan nilai p-value yang tidak significant. Menunjukkan fitur yang tidak berelasi kuat dengan variabel target. Predictors : tidak ada prediktor yang menunjukkan hasil yang tinggi (semua p-values > 0.05) Deviance : nilai null deviance 561.38 dan residual deviance 0.00000039995 mengindikasikan overfitting

Observasi model Model ini memiliki coefficient dan standar error yang lebih realistis. Beberapa prediktor menunjukkan hasil yang baik seperti Budget (3bintang), Marketing.expense(3bintang), Movie_length(3bintang), Lead_Actress_rating(1bintang), dan Critic_rating(1bintang) yang menunjukkan variabel memiliki relasi yang baik dengan variabel target.

VIII. Predict

# Predict test_data
predicted_probabilities_test <- predict(model, newdata = test_data, type = "response")
predicted_probabilities_test
##            5            6           14           15           17           18 
## 9.583902e-01 4.965116e-01 2.708238e-01 2.161229e-01 7.995395e-01 1.297925e-01 
##           19           31           33           48           52           55 
## 2.515332e-01 3.520443e-02 1.987650e-01 1.773051e-01 4.163235e-01 2.073448e-01 
##           62           63           64           65           71           82 
## 3.443205e-02 2.663007e-01 7.729818e-01 7.264012e-01 9.430523e-01 4.464873e-01 
##           83           86           87           95           97          102 
## 9.036370e-01 8.849264e-01 5.710478e-01 4.996169e-01 7.661037e-01 9.096300e-01 
##          105          114          119          127          128          132 
## 2.629064e-01 3.929880e-01 2.947193e-01 1.316618e-01 9.656462e-02 1.928969e-01 
##          135          136          142          144          150          157 
## 8.620261e-02 3.866147e-01 1.250611e-02 3.859681e-02 4.233765e-02 3.382130e-02 
##          172          173          180          182          186          189 
## 1.790707e-01 1.629269e-01 9.808972e-01 7.001760e-01 5.056124e-01 9.788240e-01 
##          192          194          197          199          205          207 
## 9.543148e-01 9.875928e-01 9.883949e-01 9.878297e-01 9.991040e-01 7.454044e-01 
##          209          218          223          227          229          235 
## 3.987894e-01 6.678907e-01 9.203462e-01 9.926902e-01 9.997826e-01 9.024851e-01 
##          236          237          243          250          252          254 
## 6.074098e-01 6.431499e-01 6.829694e-01 9.300475e-01 9.184304e-01 9.995266e-01 
##          255          271          280          282          287          299 
## 3.768803e-01 4.813464e-01 9.914594e-01 9.743895e-01 5.611726e-01 9.101849e-01 
##          303          307          315          317          327          335 
## 9.593827e-01 9.816081e-01 3.945587e-01 1.497358e-01 7.271344e-01 6.599393e-01 
##          336          338          352          356          367          368 
## 3.403899e-01 3.280902e-01 3.417480e-01 1.955951e-01 7.464864e-03 6.847839e-05 
##          369          374          375          377          380          383 
## 5.998306e-03 1.510973e-03 2.653916e-05 1.607898e-01 2.390535e-02 1.821117e-02 
##          390          393          399          418          422          429 
## 7.140381e-03 1.635269e-03 1.711045e-05 2.185847e-04 8.323154e-02 3.936993e-01 
##          431          436          462          464          466          468 
## 1.383401e-01 3.088634e-01 2.876975e-01 2.174922e-01 2.034058e-01 7.728164e-02 
##          474          483          486          493          501 
## 8.741133e-01 8.091946e-01 7.574132e-01 2.496067e-01 3.931975e-01
# Klasifikasi prediksi 1 (high collection) atau 0 (low collection) menggunakan threshold
predicted_class_test <- ifelse(predicted_probabilities_test > 0.5, 1, 0)
predicted_class_test
##   5   6  14  15  17  18  19  31  33  48  52  55  62  63  64  65  71  82  83  86 
##   1   0   0   0   1   0   0   0   0   0   0   0   0   0   1   1   1   0   1   1 
##  87  95  97 102 105 114 119 127 128 132 135 136 142 144 150 157 172 173 180 182 
##   1   0   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   1 
## 186 189 192 194 197 199 205 207 209 218 223 227 229 235 236 237 243 250 252 254 
##   1   1   1   1   1   1   1   1   0   1   1   1   1   1   1   1   1   1   1   1 
## 255 271 280 282 287 299 303 307 315 317 327 335 336 338 352 356 367 368 369 374 
##   0   0   1   1   1   1   1   1   0   0   1   1   0   0   0   0   0   0   0   0 
## 375 377 380 383 390 393 399 418 422 429 431 436 462 464 466 468 474 483 486 493 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   1   1   0 
## 501 
##   0

IX. Model Evaluation

# Membandingkan predicted classes dengan actual classes
conf_matrix <- table(Predicted = predicted_class_test, Actual = test_data$high_collection)
conf_matrix
##          Actual
## Predicted  0  1
##         0 48 11
##         1  3 39
# Kalkulasi akurasi
accuracy <- sum(predicted_class_test == test_data$high_collection) / nrow(test_data)
print(paste("Accuracy:", accuracy))
## [1] "Accuracy: 0.861386138613861"
# Confusion Matrix
confusionMatrix(factor(predicted_class_test), factor(test_data$high_collection))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 48 11
##          1  3 39
##                                           
##                Accuracy : 0.8614          
##                  95% CI : (0.7784, 0.9221)
##     No Information Rate : 0.505           
##     P-Value [Acc > NIR] : 4.937e-14       
##                                           
##                   Kappa : 0.7223          
##                                           
##  Mcnemar's Test P-Value : 0.06137         
##                                           
##             Sensitivity : 0.9412          
##             Specificity : 0.7800          
##          Pos Pred Value : 0.8136          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.5050          
##          Detection Rate : 0.4752          
##    Detection Prevalence : 0.5842          
##       Balanced Accuracy : 0.8606          
##                                           
##        'Positive' Class : 0               
## 

Kesimpulan Evaluasi Model : Accuracy : Model mencapai akurasi 86.14% . Ini merupakan indikator performa model cukup dapat diandalkan untuk memprediksi apakah suatu film akan memiliki collection yang tinggi atau rendah

95% Confidence Interval: range (0.7784, 0.9221) menunjukkan akurasi model signifikan secara statistik dan konsisten saat diuji dengan data test

Confusion Matrix: True Positives (48): Model berhasil memprediksi 48 instance di mana collection film rendah (0) dengan benar. True Negatives (39): Model berhasil memprediksi 39 instance di mana collection film tinggi (1) dengan benar. False Positives (3): Model salah memprediksi 3 instance sebagai collection tinggi padahal sebenarnya rendah. False Negatives (11): Model salah memprediksi 11 instance sebagai collection rendah padahal sebenarnya tinggi.

Sensitivitas dan Spesifisitas: Sensitivitas (Recall untuk Kelas 0): 0.9412 menunjukkan bahwa model berhasil mendeteksi 94.12% dari kelas 0 aktual (collection rendah). Artinya, model sangat efektif dalam mendeteksi kasus collection rendah. Spesifisitas (Recall untuk Kelas 1): 0.7800 menunjukkan bahwa model berhasil mendeteksi 78% dari kelas 1 aktual (collection tinggi). Meskipun lebih rendah dari sensitivitas, ini masih menunjukkan kemampuan yang baik dalam mendeteksi kasus collection tinggi. 4. Nilai Prediktif Positif dan Negatif: Nilai Prediktif Positif (Presisi untuk Kelas 0): 0.8136 menunjukkan bahwa ketika model memprediksi 0, prediksinya benar sebesar 81.36% dari waktu. Nilai Prediktif Negatif (Presisi untuk Kelas 1): 0.9286 menunjukkan bahwa ketika model memprediksi 1, prediksinya benar sebesar 92.86% dari waktu.

Akurasi Seimbang: Akurasi Seimbang: 0.8606 merepresentasikan rata-rata dari sensitivitas dan spesifisitas. Metrik ini penting ketika dataset mungkin tidak memiliki representasi kelas yang seimbang, menunjukkan bahwa model berkinerja baik untuk kedua kelas positif dan negatif.

False Negative relatif lebih tinggi (11 kasus) dibandingkan dengan false positive (3 kasus). Ini mungkin menunjukkan bahwa model cenderung melewatkan kasus collection tinggi, yang mungkin memerlukan penyesuaian jika mendeteksi kasus ini sangat penting.