Library
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(corrplot)
## corrplot 0.95 loaded
library(caret)
## Loading required package: lattice
library(caTools)
## Warning: package 'caTools' was built under R version 4.4.2
Industri perfilman menghadapi tantangan besar dalam memprediksi keberhasilan film di pasar yang kompetitif. Memahami faktor-faktor kunci yang mempengaruhi koleksi pendapatan film sangat penting bagi produser, studio, dan investor untuk membuat keputusan strategis. Oleh karena itu, diperlukan pemodelan prediktif untuk mengidentifikasi variabel-variabel yang berkontribusi terhadap kesuksesan film, seperti anggaran, pengeluaran pemasaran, durasi film, dan penilaian dari aktor utama, sutradara, dan kritikus. Penelitian ini bertujuan untuk membangun model klasifikasi yang dapat memprediksi keberhasilan koleksi film berdasarkan data historis, sehingga dapat memberikan wawasan yang lebih dalam tentang faktor-faktor yang menentukan keberhasilan film di box office.
##Daftar Langkah 1. Problem Statement 2. Read Data + Data Understanding 3. Data Wrangling 4. EDA 5. Cross Validation 6. Build Model 7. Model Selection & Assumption 8. Predict 9. Model Evaluation
data <- read.csv("kaggle/Movie_classification.csv")
str(data)
## 'data.frame': 506 obs. of 19 variables:
## $ Marketing.expense : num 20.1 20.5 20.5 20.6 21.4 ...
## $ Production.expense : num 59.6 69.1 69.1 59.4 59.4 ...
## $ Multiplex.coverage : num 0.462 0.531 0.531 0.542 0.542 0.542 0.476 0.476 0.476 0.476 ...
## $ Budget : num 36524 35669 39913 38874 39702 ...
## $ Movie_length : num 139 152 135 119 128 ...
## $ Lead_.Actor_Rating : num 7.83 7.5 7.49 6.89 6.92 ...
## $ Lead_Actress_rating: num 8.1 7.65 7.57 7.04 7.07 ...
## $ Director_rating : num 7.91 7.44 7.5 6.92 6.82 ...
## $ Producer_rating : num 8 7.47 7.51 7.02 7.07 ...
## $ Critic_rating : num 7.94 7.44 7.44 8.26 8.26 7.26 8.96 7.96 7.96 7.96 ...
## $ Trailer_views : int 527367 494055 547051 516279 531448 498425 459241 400821 295168 412012 ...
## $ X3D_available : chr "YES" "NO" "NO" "YES" ...
## $ Time_taken : num 110 147 148 185 176 ...
## $ Twitter_hastags : num 224 243 2022 225 226 ...
## $ Genre : chr "Thriller" "Drama" "Comedy" "Drama" ...
## $ Avg_age_actors : int 23 42 38 45 55 53 41 56 55 45 ...
## $ Num_multiplex : int 494 462 458 472 395 460 522 571 564 508 ...
## $ Collection : int 48000 43200 69400 66800 72400 57400 45800 44200 33000 37800 ...
## $ Start_Tech_Oscar : int 1 0 1 1 1 0 0 0 1 1 ...
summary(data)
## Marketing.expense Production.expense Multiplex.coverage Budget
## Min. : 20.13 Min. : 55.92 Min. :0.1290 Min. :19781
## 1st Qu.: 21.64 1st Qu.: 65.38 1st Qu.:0.3760 1st Qu.:32694
## Median : 25.13 Median : 74.38 Median :0.4620 Median :34488
## Mean : 92.27 Mean : 77.27 Mean :0.4453 Mean :34911
## 3rd Qu.: 93.54 3rd Qu.: 91.20 3rd Qu.:0.5510 3rd Qu.:36794
## Max. :1799.52 Max. :110.48 Max. :0.6150 Max. :48773
##
## Movie_length Lead_.Actor_Rating Lead_Actress_rating Director_rating
## Min. : 76.4 Min. :3.840 Min. :4.035 Min. :3.840
## 1st Qu.:118.5 1st Qu.:7.316 1st Qu.:7.504 1st Qu.:7.296
## Median :151.0 Median :8.307 Median :8.495 Median :8.312
## Mean :142.1 Mean :8.014 Mean :8.186 Mean :8.020
## 3rd Qu.:167.6 3rd Qu.:8.865 3rd Qu.:9.030 3rd Qu.:8.884
## Max. :173.5 Max. :9.435 Max. :9.540 Max. :9.425
##
## Producer_rating Critic_rating Trailer_views X3D_available
## Min. :4.030 Min. :6.600 Min. :212912 Length:506
## 1st Qu.:7.508 1st Qu.:7.200 1st Qu.:409128 Class :character
## Median :8.465 Median :7.960 Median :462460 Mode :character
## Mean :8.191 Mean :7.811 Mean :449861
## 3rd Qu.:9.030 3rd Qu.:8.260 3rd Qu.:500248
## Max. :9.635 Max. :9.400 Max. :567784
##
## Time_taken Twitter_hastags Genre Avg_age_actors
## Min. : 0.0 Min. : 201.2 Length:506 Min. : 3.00
## 1st Qu.:132.3 1st Qu.: 223.8 Class :character 1st Qu.:28.00
## Median :160.0 Median : 254.4 Mode :character Median :39.00
## Mean :157.4 Mean : 260.8 Mean :39.18
## 3rd Qu.:181.9 3rd Qu.: 283.4 3rd Qu.:50.00
## Max. :217.5 Max. :2022.4 Max. :60.00
## NA's :12
## Num_multiplex Collection Start_Tech_Oscar
## Min. :333.0 Min. : 10000 Min. :0.0000
## 1st Qu.:465.0 1st Qu.: 34050 1st Qu.:0.0000
## Median :535.5 Median : 42400 Median :1.0000
## Mean :545.0 Mean : 45058 Mean :0.5455
## 3rd Qu.:614.8 3rd Qu.: 50000 3rd Qu.:1.0000
## Max. :868.0 Max. :100000 Max. :1.0000
##
# Mengganti nilai missing di 'Time_taken' dengan median
data$Time_taken[is.na(data$Time_taken)] <- median(data$Time_taken, na.rm = TRUE)
# Encoding variabel kategorikal 'X3D_available' dan 'Genre'
data$X3D_available <- as.factor(ifelse(data$X3D_available == "YES", 1, 0))
data$Genre <- as.factor(data$Genre)
#Buat Plot
ggplot(data, aes(x = Collection)) + geom_histogram(binwidth = 5000, fill = "blue", color = "black") +
theme_minimal() + labs(title = "Distribusi Koleksi")
#Buat Corrplot
numeric_data <- data[sapply(data, is.numeric)]
corr_matrix <- cor(numeric_data)
corrplot(corr_matrix, method = "color")
# Memilih hanya kolom numerik dari data frame
numeric_data <- data[, sapply(data, is.numeric)]
# Menampilkan matriks korelasi
correlation_matrix <- cor(numeric_data, use = "complete.obs")
print(correlation_matrix)
## Marketing.expense Production.expense Multiplex.coverage
## Marketing.expense 1.00000000 0.4065834114 -0.420971711
## Production.expense 0.40658341 1.0000000000 -0.763651447
## Multiplex.coverage -0.42097171 -0.7636514469 1.000000000
## Budget -0.21924670 -0.3916758527 0.302188188
## Movie_length 0.35273425 0.6447785114 -0.731470104
## Lead_.Actor_Rating 0.38005024 0.7064813315 -0.768589016
## Lead_Actress_rating 0.37981329 0.7079564568 -0.769724000
## Director_rating 0.38006890 0.7075663557 -0.769156999
## Producer_rating 0.37646167 0.7058188780 -0.764873344
## Critic_rating -0.18498503 -0.2515646606 0.145555231
## Trailer_views -0.44345715 -0.5916570937 0.581386123
## Time_taken 0.02610779 0.0157361720 0.034891155
## Twitter_hastags 0.01351787 -0.0008386304 0.004881587
## Avg_age_actors 0.05920392 0.0558098001 -0.092103994
## Num_multiplex 0.38329769 0.7075585134 -0.915494613
## Collection -0.38958244 -0.4847543793 0.429300219
## Start_Tech_Oscar -0.01341676 -0.0244044234 -0.004017022
## Budget Movie_length Lead_.Actor_Rating
## Marketing.expense -0.21924670 0.352734251 0.38005024
## Production.expense -0.39167585 0.644778511 0.70648133
## Multiplex.coverage 0.30218819 -0.731470104 -0.76858902
## Budget 1.00000000 -0.240264931 -0.20846434
## Movie_length -0.24026493 1.000000000 0.74690431
## Lead_.Actor_Rating -0.20846434 0.746904308 1.00000000
## Lead_Actress_rating -0.20398069 0.746493103 0.99790453
## Director_rating -0.20190700 0.747020751 0.99773453
## Producer_rating -0.20539683 0.746706615 0.99407321
## Critic_rating 0.23236071 -0.217830158 -0.16997774
## Trailer_views 0.60253604 -0.589317762 -0.49026692
## Time_taken 0.03969711 -0.019136275 0.03852357
## Twitter_hastags 0.03067351 0.009379579 0.01446283
## Avg_age_actors -0.06469390 0.075197765 0.03679436
## Num_multiplex -0.28279597 0.673895783 0.70633064
## Collection 0.69630379 -0.377998896 -0.25135494
## Start_Tech_Oscar -0.02714765 0.016291411 -0.03530851
## Lead_Actress_rating Director_rating Producer_rating
## Marketing.expense 0.37981329 0.38006890 0.376461674
## Production.expense 0.70795646 0.70756636 0.705818878
## Multiplex.coverage -0.76972400 -0.76915700 -0.764873344
## Budget -0.20398069 -0.20190700 -0.205396835
## Movie_length 0.74649310 0.74702075 0.746706615
## Lead_.Actor_Rating 0.99790453 0.99773453 0.994073208
## Lead_Actress_rating 1.00000000 0.99809682 0.994002603
## Director_rating 0.99809682 1.00000000 0.994125524
## Producer_rating 0.99400260 0.99412552 1.000000000
## Critic_rating -0.16599207 -0.16663786 -0.167003414
## Trailer_views -0.48753605 -0.48645183 -0.487910743
## Time_taken 0.03840683 0.03629349 0.029083451
## Twitter_hastags 0.01023920 0.01007678 0.005849935
## Avg_age_actors 0.03800505 0.04146959 0.032542133
## Num_multiplex 0.70825662 0.70936354 0.703517871
## Collection -0.24945945 -0.24665004 -0.248200181
## Start_Tech_Oscar -0.04035558 -0.03576832 -0.043612169
## Critic_rating Trailer_views Time_taken Twitter_hastags
## Marketing.expense -0.184985034 -0.443457148 0.026107787 0.0135178715
## Production.expense -0.251564661 -0.591657094 0.015736172 -0.0008386304
## Multiplex.coverage 0.145555231 0.581386123 0.034891155 0.0048815872
## Budget 0.232360714 0.602536042 0.039697111 0.0306735090
## Movie_length -0.217830158 -0.589317762 -0.019136275 0.0093795793
## Lead_.Actor_Rating -0.169977738 -0.490266915 0.038523569 0.0144628347
## Lead_Actress_rating -0.165992070 -0.487536047 0.038406833 0.0102391975
## Director_rating -0.166637855 -0.486451834 0.036293493 0.0100767796
## Producer_rating -0.167003414 -0.487910743 0.029083451 0.0058499350
## Critic_rating 1.000000000 0.228641376 -0.015897790 -0.0236552818
## Trailer_views 0.228641376 1.000000000 0.073588662 -0.0067042514
## Time_taken -0.015897790 0.073588662 1.000000000 -0.0065635530
## Twitter_hastags -0.023655282 -0.006704251 -0.006563553 1.0000000000
## Avg_age_actors -0.049797149 -0.049725754 0.071199513 -0.0048398133
## Num_multiplex -0.128769284 -0.544099603 -0.056093462 0.0062551706
## Collection 0.341288410 0.720119419 0.109127690 0.0231221267
## Start_Tech_Oscar -0.001084021 -0.075782817 -0.063930342 0.0773331176
## Avg_age_actors Num_multiplex Collection Start_Tech_Oscar
## Marketing.expense 0.059203919 0.383297685 -0.38958244 -0.013416763
## Production.expense 0.055809800 0.707558513 -0.48475438 -0.024404423
## Multiplex.coverage -0.092103994 -0.915494613 0.42930022 -0.004017022
## Budget -0.064693897 -0.282795967 0.69630379 -0.027147654
## Movie_length 0.075197765 0.673895783 -0.37799890 0.016291411
## Lead_.Actor_Rating 0.036794363 0.706330636 -0.25135494 -0.035308511
## Lead_Actress_rating 0.038005053 0.708256617 -0.24945945 -0.040355577
## Director_rating 0.041469589 0.709363544 -0.24665004 -0.035768322
## Producer_rating 0.032542133 0.703517871 -0.24820018 -0.043612169
## Critic_rating -0.049797149 -0.128769284 0.34128841 -0.001084021
## Trailer_views -0.049725754 -0.544099603 0.72011942 -0.075782817
## Time_taken 0.071199513 -0.056093462 0.10912769 -0.063930342
## Twitter_hastags -0.004839813 0.006255171 0.02312213 0.077333118
## Avg_age_actors 1.000000000 0.078811018 -0.04742623 0.040580566
## Num_multiplex 0.078811018 1.000000000 -0.39172857 -0.004857210
## Collection -0.047426233 -0.391728569 1.00000000 0.154698364
## Start_Tech_Oscar 0.040580566 -0.004857210 0.15469836 1.000000000
Positive Correlation Lead Actor Rating, Lead Actress Rating, Director Rating dan Producer Rating memiliki korelasi yang positif (sekitar 0.99), menunjukkan rating yang tinggi di salah satu variabel akan memiliki rating yang tinggi juga di variabel yang lain. Korelasi antara production Expense dengan Movie Length (0.64) menunjukkan film yang berdurasi lebih panjang memiliki biaya produksi yang lebih tinggi. Collection dengan Budget (0.70) menunjukkan film dengan budget lebih tinggi meningkatkan kemungkinan film tersebut menghasilkan collection box office yang lebih tinggi.
Negative Correlation Multiplex Coverage dan Movie Length. Film berdurasi panjang sering menunjukkan multiplexes yang lebih sedikit.
# Buat variable high collection berdasarkan mediannya agar bisa dibuat menjadi logistic regression
data$high_collection <- ifelse(data$Collection > median(data$Collection, na.rm = TRUE), 1, 0)
str(data)
## 'data.frame': 506 obs. of 20 variables:
## $ Marketing.expense : num 20.1 20.5 20.5 20.6 21.4 ...
## $ Production.expense : num 59.6 69.1 69.1 59.4 59.4 ...
## $ Multiplex.coverage : num 0.462 0.531 0.531 0.542 0.542 0.542 0.476 0.476 0.476 0.476 ...
## $ Budget : num 36524 35669 39913 38874 39702 ...
## $ Movie_length : num 139 152 135 119 128 ...
## $ Lead_.Actor_Rating : num 7.83 7.5 7.49 6.89 6.92 ...
## $ Lead_Actress_rating: num 8.1 7.65 7.57 7.04 7.07 ...
## $ Director_rating : num 7.91 7.44 7.5 6.92 6.82 ...
## $ Producer_rating : num 8 7.47 7.51 7.02 7.07 ...
## $ Critic_rating : num 7.94 7.44 7.44 8.26 8.26 7.26 8.96 7.96 7.96 7.96 ...
## $ Trailer_views : int 527367 494055 547051 516279 531448 498425 459241 400821 295168 412012 ...
## $ X3D_available : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 2 2 ...
## $ Time_taken : num 110 147 148 185 176 ...
## $ Twitter_hastags : num 224 243 2022 225 226 ...
## $ Genre : Factor w/ 4 levels "Action","Comedy",..: 4 3 2 3 3 2 4 3 2 4 ...
## $ Avg_age_actors : int 23 42 38 45 55 53 41 56 55 45 ...
## $ Num_multiplex : int 494 462 458 472 395 460 522 571 564 508 ...
## $ Collection : int 48000 43200 69400 66800 72400 57400 45800 44200 33000 37800 ...
## $ Start_Tech_Oscar : int 1 0 1 1 1 0 0 0 1 1 ...
## $ high_collection : num 1 1 1 1 1 1 1 1 0 0 ...
# Set seed
set.seed(88)
# Split data into training (80%) and test (20%) sets
split <- sample.split(data$high_collection, SplitRatio = 0.8)
train_data <- subset(data, split == TRUE)
test_data <- subset(data, split == FALSE)
Buat Data Training dan Data Testing 80-20
# Fit logistic regression model
model_all <- glm(high_collection ~ .,
data = train_data, family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_all)
##
## Call:
## glm(formula = high_collection ~ ., family = "binomial", data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.247e+03 3.263e+05 -0.004 0.997
## Marketing.expense 5.561e-03 9.649e+01 0.000 1.000
## Production.expense 9.144e-01 4.339e+02 0.002 0.998
## Multiplex.coverage 6.490e+00 1.280e+05 0.000 1.000
## Budget -1.859e-03 2.791e+00 -0.001 0.999
## Movie_length 4.811e-01 4.939e+02 0.001 0.999
## Lead_.Actor_Rating 1.160e+01 8.660e+04 0.000 1.000
## Lead_Actress_rating -2.381e+01 1.531e+05 0.000 1.000
## Director_rating 3.753e+01 2.165e+05 0.000 1.000
## Producer_rating -3.738e+01 7.343e+04 -0.001 1.000
## Critic_rating 1.673e+00 1.921e+04 0.000 1.000
## Trailer_views 1.620e-04 2.487e-01 0.001 0.999
## X3D_available1 8.518e+00 1.910e+04 0.000 1.000
## Time_taken -2.169e-01 2.693e+02 -0.001 0.999
## Twitter_hastags -1.897e-01 2.542e+02 -0.001 0.999
## GenreComedy -1.003e+01 2.691e+04 0.000 1.000
## GenreDrama -1.825e+01 2.045e+04 -0.001 0.999
## GenreThriller -1.735e+01 1.407e+04 -0.001 0.999
## Avg_age_actors -1.297e-01 7.331e+02 0.000 1.000
## Num_multiplex -1.963e-02 2.630e+02 0.000 1.000
## Collection 3.038e-02 2.547e+00 0.012 0.990
## Start_Tech_Oscar 1.630e+00 2.136e+04 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5.6139e+02 on 404 degrees of freedom
## Residual deviance: 3.9995e-07 on 383 degrees of freedom
## AIC: 44
##
## Number of Fisher Scoring iterations: 25
# Fit logistic regression model
model <- glm(high_collection ~ Budget + Marketing.expense + Production.expense +
Multiplex.coverage + Movie_length + Lead_.Actor_Rating +
Lead_Actress_rating + Director_rating + Producer_rating +
Critic_rating,
data = train_data, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_collection ~ Budget + Marketing.expense +
## Production.expense + Multiplex.coverage + Movie_length +
## Lead_.Actor_Rating + Lead_Actress_rating + Director_rating +
## Producer_rating + Critic_rating, family = "binomial", data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.383e+01 4.630e+00 -5.146 2.66e-07 ***
## Budget 5.610e-04 7.918e-05 7.085 1.39e-12 ***
## Marketing.expense -9.657e-03 2.796e-03 -3.454 0.000552 ***
## Production.expense -7.115e-04 1.769e-02 -0.040 0.967910
## Multiplex.coverage 1.640e+00 2.459e+00 0.667 0.504907
## Movie_length -5.119e-02 9.546e-03 -5.362 8.22e-08 ***
## Lead_.Actor_Rating 3.206e+00 2.465e+00 1.301 0.193339
## Lead_Actress_rating -5.922e+00 2.614e+00 -2.265 0.023495 *
## Director_rating 3.296e+00 2.552e+00 1.291 0.196607
## Producer_rating 3.719e-01 1.349e+00 0.276 0.782791
## Critic_rating 6.264e-01 2.489e-01 2.517 0.011839 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 561.39 on 404 degrees of freedom
## Residual deviance: 291.88 on 394 degrees of freedom
## AIC: 313.88
##
## Number of Fisher Scoring iterations: 7
Observasi model_all Coefficients : Banyak coefficient yang memiliki error yang tinggi dan nilai p-value yang tidak significant. Menunjukkan fitur yang tidak berelasi kuat dengan variabel target. Predictors : tidak ada prediktor yang menunjukkan hasil yang tinggi (semua p-values > 0.05) Deviance : nilai null deviance 561.38 dan residual deviance 0.00000039995 mengindikasikan overfitting
Observasi model Model ini memiliki coefficient dan standar error yang lebih realistis. Beberapa prediktor menunjukkan hasil yang baik seperti Budget (3bintang), Marketing.expense(3bintang), Movie_length(3bintang), Lead_Actress_rating(1bintang), dan Critic_rating(1bintang) yang menunjukkan variabel memiliki relasi yang baik dengan variabel target.
# Predict test_data
predicted_probabilities_test <- predict(model, newdata = test_data, type = "response")
predicted_probabilities_test
## 5 6 14 15 17 18
## 9.583902e-01 4.965116e-01 2.708238e-01 2.161229e-01 7.995395e-01 1.297925e-01
## 19 31 33 48 52 55
## 2.515332e-01 3.520443e-02 1.987650e-01 1.773051e-01 4.163235e-01 2.073448e-01
## 62 63 64 65 71 82
## 3.443205e-02 2.663007e-01 7.729818e-01 7.264012e-01 9.430523e-01 4.464873e-01
## 83 86 87 95 97 102
## 9.036370e-01 8.849264e-01 5.710478e-01 4.996169e-01 7.661037e-01 9.096300e-01
## 105 114 119 127 128 132
## 2.629064e-01 3.929880e-01 2.947193e-01 1.316618e-01 9.656462e-02 1.928969e-01
## 135 136 142 144 150 157
## 8.620261e-02 3.866147e-01 1.250611e-02 3.859681e-02 4.233765e-02 3.382130e-02
## 172 173 180 182 186 189
## 1.790707e-01 1.629269e-01 9.808972e-01 7.001760e-01 5.056124e-01 9.788240e-01
## 192 194 197 199 205 207
## 9.543148e-01 9.875928e-01 9.883949e-01 9.878297e-01 9.991040e-01 7.454044e-01
## 209 218 223 227 229 235
## 3.987894e-01 6.678907e-01 9.203462e-01 9.926902e-01 9.997826e-01 9.024851e-01
## 236 237 243 250 252 254
## 6.074098e-01 6.431499e-01 6.829694e-01 9.300475e-01 9.184304e-01 9.995266e-01
## 255 271 280 282 287 299
## 3.768803e-01 4.813464e-01 9.914594e-01 9.743895e-01 5.611726e-01 9.101849e-01
## 303 307 315 317 327 335
## 9.593827e-01 9.816081e-01 3.945587e-01 1.497358e-01 7.271344e-01 6.599393e-01
## 336 338 352 356 367 368
## 3.403899e-01 3.280902e-01 3.417480e-01 1.955951e-01 7.464864e-03 6.847839e-05
## 369 374 375 377 380 383
## 5.998306e-03 1.510973e-03 2.653916e-05 1.607898e-01 2.390535e-02 1.821117e-02
## 390 393 399 418 422 429
## 7.140381e-03 1.635269e-03 1.711045e-05 2.185847e-04 8.323154e-02 3.936993e-01
## 431 436 462 464 466 468
## 1.383401e-01 3.088634e-01 2.876975e-01 2.174922e-01 2.034058e-01 7.728164e-02
## 474 483 486 493 501
## 8.741133e-01 8.091946e-01 7.574132e-01 2.496067e-01 3.931975e-01
# Klasifikasi prediksi 1 (high collection) atau 0 (low collection) menggunakan threshold
predicted_class_test <- ifelse(predicted_probabilities_test > 0.5, 1, 0)
predicted_class_test
## 5 6 14 15 17 18 19 31 33 48 52 55 62 63 64 65 71 82 83 86
## 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1
## 87 95 97 102 105 114 119 127 128 132 135 136 142 144 150 157 172 173 180 182
## 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
## 186 189 192 194 197 199 205 207 209 218 223 227 229 235 236 237 243 250 252 254
## 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## 255 271 280 282 287 299 303 307 315 317 327 335 336 338 352 356 367 368 369 374
## 0 0 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0
## 375 377 380 383 390 393 399 418 422 429 431 436 462 464 466 468 474 483 486 493
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0
## 501
## 0
# Membandingkan predicted classes dengan actual classes
conf_matrix <- table(Predicted = predicted_class_test, Actual = test_data$high_collection)
conf_matrix
## Actual
## Predicted 0 1
## 0 48 11
## 1 3 39
# Kalkulasi akurasi
accuracy <- sum(predicted_class_test == test_data$high_collection) / nrow(test_data)
print(paste("Accuracy:", accuracy))
## [1] "Accuracy: 0.861386138613861"
# Confusion Matrix
confusionMatrix(factor(predicted_class_test), factor(test_data$high_collection))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 48 11
## 1 3 39
##
## Accuracy : 0.8614
## 95% CI : (0.7784, 0.9221)
## No Information Rate : 0.505
## P-Value [Acc > NIR] : 4.937e-14
##
## Kappa : 0.7223
##
## Mcnemar's Test P-Value : 0.06137
##
## Sensitivity : 0.9412
## Specificity : 0.7800
## Pos Pred Value : 0.8136
## Neg Pred Value : 0.9286
## Prevalence : 0.5050
## Detection Rate : 0.4752
## Detection Prevalence : 0.5842
## Balanced Accuracy : 0.8606
##
## 'Positive' Class : 0
##
Kesimpulan Evaluasi Model : Accuracy : Model mencapai akurasi 86.14% . Ini merupakan indikator performa model cukup dapat diandalkan untuk memprediksi apakah suatu film akan memiliki collection yang tinggi atau rendah
95% Confidence Interval: range (0.7784, 0.9221) menunjukkan akurasi model signifikan secara statistik dan konsisten saat diuji dengan data test
Confusion Matrix: True Positives (48): Model berhasil memprediksi 48 instance di mana collection film rendah (0) dengan benar. True Negatives (39): Model berhasil memprediksi 39 instance di mana collection film tinggi (1) dengan benar. False Positives (3): Model salah memprediksi 3 instance sebagai collection tinggi padahal sebenarnya rendah. False Negatives (11): Model salah memprediksi 11 instance sebagai collection rendah padahal sebenarnya tinggi.
Sensitivitas dan Spesifisitas: Sensitivitas (Recall untuk Kelas 0): 0.9412 menunjukkan bahwa model berhasil mendeteksi 94.12% dari kelas 0 aktual (collection rendah). Artinya, model sangat efektif dalam mendeteksi kasus collection rendah. Spesifisitas (Recall untuk Kelas 1): 0.7800 menunjukkan bahwa model berhasil mendeteksi 78% dari kelas 1 aktual (collection tinggi). Meskipun lebih rendah dari sensitivitas, ini masih menunjukkan kemampuan yang baik dalam mendeteksi kasus collection tinggi. 4. Nilai Prediktif Positif dan Negatif: Nilai Prediktif Positif (Presisi untuk Kelas 0): 0.8136 menunjukkan bahwa ketika model memprediksi 0, prediksinya benar sebesar 81.36% dari waktu. Nilai Prediktif Negatif (Presisi untuk Kelas 1): 0.9286 menunjukkan bahwa ketika model memprediksi 1, prediksinya benar sebesar 92.86% dari waktu.
Akurasi Seimbang: Akurasi Seimbang: 0.8606 merepresentasikan rata-rata dari sensitivitas dan spesifisitas. Metrik ini penting ketika dataset mungkin tidak memiliki representasi kelas yang seimbang, menunjukkan bahwa model berkinerja baik untuk kedua kelas positif dan negatif.
False Negative relatif lebih tinggi (11 kasus) dibandingkan dengan false positive (3 kasus). Ini mungkin menunjukkan bahwa model cenderung melewatkan kasus collection tinggi, yang mungkin memerlukan penyesuaian jika mendeteksi kasus ini sangat penting.