한국 영화, 미국 시장 진출 시 수익 예측

Python을 이용해 IMDB사이트의 미국영화와 한국영화의 데이터셋을 만들었다.

setwd("C:/NCS/Rwork_II/R-script/project")
movie_USA <- read.csv('movie_metadata_USA.csv', header = T, stringsAsFactors = F)
str(movie_USA)
## 'data.frame':    5151 obs. of  28 variables:
##  $ color                    : chr  "Color" "Color" "Color" "Color" ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 2.35 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_title              : chr  "Avatar" "Spectre" "The Dark Knight Rises" "Star Wars: The Force Awakens" ...
##  $ movie_facebook_likes     : int  34000 88000 160000 260000 5277 24000 48000 29000 3813 110000 ...
##  $ title_year               : int  2009 2015 2012 2015 2007 2012 2013 2010 2007 2015 ...
##  $ country                  : chr  "UK" "UK" "UK" "USA" ...
##  $ language                 : chr  "English" "English" "English" "English" ...
##  $ content_rating           : chr  "PG-13" "PG-13" "PG-13" "PG-13" ...
##  $ genres                   : chr  "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Thriller" "Action|Thriller" "Action|Adventure|Fantasy|Sci-Fi" ...
##  $ duration                 : int  162 148 164 136 169 132 150 100 139 141 ...
##  $ director_name            : chr  "James Cameron" "Sam Mendes" "Christopher Nolan" "J.J. Abrams" ...
##  $ director_facebook_likes  : int  7006 2214 22000 14000 575 476 575 17 1322 5324 ...
##  $ actor_1_name             : chr  "Sam Worthington" "Daniel Craig" "Christian Bale" "Harrison Ford" ...
##  $ actor_1_facebook_likes   : int  5334 8520 23000 11000 40000 5626 40000 2170 2880 21000 ...
##  $ actor_2_name             : chr  "Zoe Saldana" "Christoph Waltz" "Gary Oldman" "Mark Hamill" ...
##  $ actor_2_facebook_likes   : int  8219 11000 10000 4533 2944 1825 5494 3682 4095 26000 ...
##  $ actor_3_name             : chr  "Sigourney Weaver" "L챕a Seydoux" "Tom Hardy" "Carrie Fisher" ...
##  $ actor_3_facebook_likes   : int  4001 1749 27000 4851 5059 650 6038 564 11000 6638 ...
##  $ total_cast_facebook_likes: int  36445 45912 128950 45219 73582 29999 67543 20644 64500 133086 ...
##  $ imdb_score               : num  7.8 6.8 8.5 8.1 7.1 6.6 6.5 7.8 6.2 7.4 ...
##  $ num_voted_users          : num  931429 307058 1216414 655296 495970 ...
##  $ num_critic_for_reviews   : int  722 630 815 831 303 466 459 327 393 654 ...
##  $ num_user_for_reviews     : int  3082 1032 2730 4117 1243 756 722 398 1908 1130 ...
##  $ gross                    : int  760505847 200074175 448130642 936627416 309404152 73058679 89289910 200807262 336530303 458991599 ...
##  $ facenumber_in_poster     : int  0 1 0 3 0 1 1 1 0 4 ...
##  $ plot_keywords            : chr  "avatar|future|marine|native|paraplegic" "bomb|espionage|sequel|spy|terrorist" "female villain|mysterious woman|suspense|woman fights a man|written by director" "droid|father son relationship|female protagonist|outer space|patricide" ...
##  $ budget                   : num  2.37e+08 2.45e+08 2.50e+08 2.45e+08 3.00e+08 2.50e+08 2.15e+08 2.60e+08 2.58e+08 2.50e+08 ...
##  $ storyline                : chr  "When his brother is killed in a robbery, paraplegic Marine Jake Sully decides to take his place in a mission on the distant wor"| __truncated__ "A cryptic message from the past sends James Bond on a rogue mission to Mexico City and eventually Rome, where he meets Lucia, t"| __truncated__ "Despite his tarnished reputation after the events of The Dark Knight, in which he took the rap for Dent's crimes, Batman feels "| __truncated__ "30 years after the defeat of Darth Vader and the Empire, Rey, a scavenger from the planet Jakku, finds a BB-8 droid that knows "| __truncated__ ...
head(movie_USA, 3)
##   color aspect_ratio           movie_title movie_facebook_likes title_year
## 1 Color         1.78                Avatar                34000       2009
## 2 Color         2.35               Spectre                88000       2015
## 3 Color         2.35 The Dark Knight Rises               160000       2012
##   country language content_rating                          genres duration
## 1      UK  English          PG-13 Action|Adventure|Fantasy|Sci-Fi      162
## 2      UK  English          PG-13       Action|Adventure|Thriller      148
## 3      UK  English          PG-13                 Action|Thriller      164
##       director_name director_facebook_likes    actor_1_name
## 1     James Cameron                    7006 Sam Worthington
## 2        Sam Mendes                    2214    Daniel Craig
## 3 Christopher Nolan                   22000  Christian Bale
##   actor_1_facebook_likes    actor_2_name actor_2_facebook_likes
## 1                   5334     Zoe Saldana                   8219
## 2                   8520 Christoph Waltz                  11000
## 3                  23000     Gary Oldman                  10000
##       actor_3_name actor_3_facebook_likes total_cast_facebook_likes
## 1 Sigourney Weaver                   4001                     36445
## 2     L챕a Seydoux                   1749                     45912
## 3        Tom Hardy                  27000                    128950
##   imdb_score num_voted_users num_critic_for_reviews num_user_for_reviews
## 1        7.8          931429                    722                 3082
## 2        6.8          307058                    630                 1032
## 3        8.5         1216414                    815                 2730
##       gross facenumber_in_poster
## 1 760505847                    0
## 2 200074175                    1
## 3 448130642                    0
##                                                                     plot_keywords
## 1                                          avatar|future|marine|native|paraplegic
## 2                                             bomb|espionage|sequel|spy|terrorist
## 3 female villain|mysterious woman|suspense|woman fights a man|written by director
##     budget
## 1 2.37e+08
## 2 2.45e+08
## 3 2.50e+08
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 storyline
## 1 When his brother is killed in a robbery, paraplegic Marine Jake Sully decides to take his place in a mission on the distant world of Pandora. There he learns of greedy corporate figurehead Parker Selfridge\\'s intentions of driving off the native humanoid "Na\\'vi" in order to mine for the precious material scattered throughout their rich woodland. In exchange for the spinal surgery that will fix his legs, Jake gathers intel for the cooperating military unit spearheaded by gung-ho Colonel Quaritch, while simultaneously attempting to infiltrate the Na\\'vi people with the use of an "avatar" identity. While Jake begins to bond with the native tribe and quickly falls in love with the beautiful alien Neytiri, the restless Colonel moves forward with his ruthless extermination tactics, forcing the soldier to take a stand - and fight back in an epic battle for the fate of Pandora.'
## 2                                                           A cryptic message from the past sends James Bond on a rogue mission to Mexico City and eventually Rome, where he meets Lucia, the beautiful and forbidden widow of an infamous criminal. Bond infiltrates a secret meeting and uncovers the existence of the sinister organisation known as SPECTRE. Meanwhile back in London, Max Denbigh, the new head of the Centre of National Security, questions Bond's actions and challenges the relevance of MI6 led by M. Bond covertly enlists Moneypenny and Q to help him seek out Madeleine Swann, the daughter of his old nemesis Mr White, who may hold the clue to untangling the web of SPECTRE. As the daughter of the assassin, she understands Bond in a way most others cannot. As Bond ventures towards the heart of SPECTRE, he learns a chilling connection between himself and the enemy he seeks."
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Despite his tarnished reputation after the events of The Dark Knight, in which he took the rap for Dent's crimes, Batman feels compelled to intervene to assist the city and its police force which is struggling to cope with Bane's plans to destroy the city."
dim(movie_USA)
## [1] 5151   28

1. 전처리

movie_USA 데이터셋을 활용해 수익(gross)예측 모델을 만들기 위해 전처리 작업이 필요하다.

(1) title_length추가

영화 제목의 길이와 수익이 관계가 있을지 보기위해, 영화 제목의 길이를 또 하나의 변수로 생성했다.

영화 제목길이를 다루기 위해서는 stringr패키지를 필요로 한다.

library(stringr)
movie_USA$title_length <- str_length(movie_USA$movie_title)

(2) NA처리

movie_USA의 NA를 0으로 바꿔준다.

movie_USA[is.na(movie_USA)] <- 0
movie_USA <- subset(movie_USA, movie_USA$gross != 0)

(3) gross를 나라, 언어, 등급 별로 그룹화하였다..

각 명목변수로 그룹화하고 요약하기 위해서 dplyr패키지를 추가해야한다.

library(dplyr)
# 등급 별 매출
content <- group_by(movie_USA,content_rating)
content_gross <- summarise(content, count=n(), gross=mean(gross, na.rm=T))

# 언어 별 매출
language <- group_by(movie_USA, language, na.rm=TRUE)
language_gross <- summarise(language, count=n(), gross=mean(gross, na.rm=T))
    
# 나라 별 매출
country <- group_by(movie_USA, country)
country_gross <- summarise(country, count=n(), gross=mean(gross, na.rm=T))

(4) 감독 이름 => 감독 등급

감독 이름(명목 변수)을 회귀 모델에 적용시키기 위해서, 수치형(비율 척도)로 변환작업이 필요하다.

감독 등급을 만들기 위해서, 감독 이름별로 그룹화 하였다.

그 후에 출품작 수, 감독 페이스북의 좋아요 수, 출품작 페이스북의 좋아요 수, 총 수익, imdb 점수의 요약통계량을 구해보자.

# 감독 등급 만들기
# (1) 감독 이름별로 group_by
director <- group_by(movie_USA, director_name)

director_grade <- summarise(director, count = n(),
                            f_b_like = mean(director_facebook_likes, na.rm=T),
                            movie_fb = sum(movie_facebook_likes, na.rm=T),
                            gross = sum(as.double(gross), na.rm=T),
                            imdb = mean(imdb_score, na.rm = T)
)
(4-1)

데이터 프레임으로 변환 후, 이름없는 감독들을 제거하였다.

director_grade <- as.data.frame(director_grade)
head(director_grade,3)
##     director_name count f_b_like movie_fb    gross     imdb
## 1                     3        0    10285   343838 8.633333
## 2 Aaron Schneider     1       15     8876  9176553 7.100000
## 3   Aaron Seltzer     1       64      815 48546578 2.700000
director_grade <- director_grade[-1, ] # 이름이 없는 감독들 제거
(4-2)

각 값들에 대하여 정규화 작업이 필요하다.

## 벡터 생성 ##
normal <- numeric()
facebook <- numeric()
movie_fb <- numeric()
gross_n <- numeric()
imdb_n <- numeric()

## 값 대입 ##
normal <- (director_grade$count-mean(director_grade$count))/sd(director_grade$count)
facebook <- (director_grade$f_b_like - mean(director_grade$f_b_like))/sd(director_grade$f_b_like)
movie_fb <- (director_grade$movie_fb - mean(director_grade$movie_fb))/sd(director_grade$movie_fb)
imdb_n <- (director_grade$imdb - mean(director_grade$imdb))/sd(director_grade$imdb)    
gross_n <- (director_grade$gross - mean(director_grade$gross))/sd(director_grade$gross)
(4-3)

각 값들에 대하여 가중치를 두고, 감독의 등급을 판단하였다.

library(plotly)
## 등급 매기기 ##
## 출편 작품(0.2), 페북좋(0.1), 수익(0.3), 평점(0.4) 표준화 --> 영화 감독 등급 만들기
director_grade$grade <- normal*0.2 + facebook*0.1 + movie_fb*0.1 + gross_n*0.3 + imdb_n*0.3  

plot_ly(director_grade, x = ~grade, type='histogram') %>%
  layout(title = '감독 등급별 분포', yaxis = list(side='left', title='Frequency'))
director_grade$grade2[director_grade$grade > 2] <-5
director_grade$grade2[director_grade$grade > 1 & director_grade$grade <=2] <- 4
director_grade$grade2[director_grade$grade > 0 & director_grade$grade <=1] <- 3
director_grade$grade2[director_grade$grade > -1 & director_grade$grade <=0] <- 2
director_grade$grade2[director_grade$grade <=-1] <- 1

(5) 배우 이름 => 배우 등급

3명의 주, 조연급 배우 이름(명목 변수)을 회귀 모델에 적용시키기 위해서, 수치형(비율 척도)로 변환작업이 필요하다.

배우 등급을 만들기 위해서, 배우 이름별로 그룹화 하였다.

그 후에 배우 페이스북의 좋아요 수, 출품작 페이스북의 좋아요 수, 총 수익, imdb 점수의 요약통계량을 구해보자.

# 배우 등급
# (1) 배우 추출
actor1 <- movie_USA[,c('actor_1_name', 'actor_1_facebook_likes', 'imdb_score','movie_facebook_likes','gross')]
actor2 <- movie_USA[,c('actor_2_name', 'actor_2_facebook_likes', 'imdb_score','movie_facebook_likes','gross')]
actor3 <- movie_USA[,c('actor_3_name', 'actor_3_facebook_likes', 'imdb_score','movie_facebook_likes','gross')]

names(actor1) <- c('actor_name', 'actor_facebook_likes', 'imdb_score','movie_facebook_likes','gross')
names(actor2) <- c('actor_name', 'actor_facebook_likes', 'imdb_score','movie_facebook_likes','gross')
names(actor3) <- c('actor_name', 'actor_facebook_likes', 'imdb_score','movie_facebook_likes','gross')

# 3개 배우 컬럼 하나로 합치기
actor <- rbind(actor1, actor2, actor3)  

## 배우 칼럼을 배우 이름을 기준으로 
## 그룹화 가능하게 만들기.
actor_name <- group_by(actor, actor_name)

## 배우 이름 기준
actor_grade <- summarise(actor_name, 
                         count = n(), 
                         actor_fb = mean(actor_facebook_likes, na.rm = T),
                         imdb_score = mean(imdb_score, na.rm = T), 
                         movie_fb = sum(movie_facebook_likes, na.rm = T),
                         gross=sum(as.double(gross), na.rm=T))
(5-1)

데이터 프레임으로 변환 후, 이름없는 배우들을 제거하였다.

actor_grade <- as.data.frame(actor_grade)
head(actor_grade,3)
##            actor_name count actor_fb imdb_score movie_fb    gross
## 1                        11        0   6.527273    35028  7121400
## 2 'Weird Al' Yankovic     1      445   7.000000     3843  6157157
## 3             50 Cent     2     1055   5.550000     3737 71058288
## 이름 없는 행 제거
actor_grade <- actor_grade[-1,] 
(5-2)

각 값들에 대하여 정규화 작업이 필요하다.

## 배우 등급 매기기. 출연작 수, 배우 페북좋, 평균 imdb스코어, 출연 영화 페북좋, 수익 합
movie_c <- numeric()
actor_f <- numeric()
actor_imdb <- numeric()
actor_m_fb <- numeric()
actor_gross <- numeric()

movie_c <- (actor_grade$count-mean(actor_grade$count))/sd(actor_grade$count)
actor_f <- (actor_grade$actor_fb-mean(actor_grade$actor_fb))/sd(actor_grade$actor_fb)
actor_imdb <- (actor_grade$imdb_score-mean(actor_grade$imdb_score))/sd(actor_grade$imdb_score)
actor_m_fb <- (actor_grade$movie_fb-mean(actor_grade$movie_fb))/sd(actor_grade$movie_fb)
actor_gross <- (actor_grade$gross-mean(actor_grade$gross))/sd(actor_grade$gross)
(5-3)

각 값들에 대하여 가중치를 두고, 배우의 등급을 판단하였다.

actor_grade$grade <- movie_c*0.3 + actor_f*0.3 + actor_imdb*0.1 + actor_m_fb*0.1 + actor_gross*0.2
actor_grade$grade2[actor_grade$grade > 2] <- 5
actor_grade$grade2[actor_grade$grade > 1 & actor_grade$grade <=2] <- 4
actor_grade$grade2[actor_grade$grade > 0 & actor_grade$grade <=1] <- 3
actor_grade$grade2[actor_grade$grade > -1 & actor_grade$grade <=0] <- 2
actor_grade$grade2[actor_grade$grade <=-1] <- 1

(6) Join

movie_USA 데이터에 감독과 배우의 등급을 추가하자.

감독 이름, 배우 이름 별로 left_join 사용

## 감독 등급
grade_d <- director_grade[,c('director_name', 'grade2')]
names(grade_d) <- c('director_name', 'director_grade')

grade_a1 <- actor_grade[,c('actor_name','grade2')]
grade_a2 <- actor_grade[,c('actor_name','grade2')]
grade_a3 <- actor_grade[,c('actor_name','grade2')]

names(grade_a1) <- c('actor_1_name','actor_1_grade')
names(grade_a2) <- c('actor_2_name','actor_2_grade')
names(grade_a3) <- c('actor_3_name','actor_3_grade')

movie_USA <- movie_USA %>% left_join(grade_d) %>% left_join(grade_a1) %>% left_join(grade_a2) %>% left_join(grade_a3)

(7) NA처리

최종 전처리후 생겨난 변수들 중 NA값을 처리해야 한다.

movie_USA[is.na(movie_USA)] <- 0

2. 수치형 데이터 상관분석 / 주성분 분석

movie_USA 데이터셋에서 gross에 영향을 주는 변수는 무엇이 있는지 상관분석을 해보자.

상관분석을 하기 위해서는 movie_USA의 수치형 데이터만 추출할 필요가 있다.

int_idx <- numeric()
    cnt <- 1
    for(k in 1:length(movie_USA)){
        if(class(movie_USA[[k]]) !='character'){
            cat(' ',k)
            int_idx[cnt] <- k
            cnt <- cnt + 1
        }
    }
##   2  4  5  10  12  14  16  18  19  20  21  22  23  24  25  27  29  30  31  32  33
# int_idx :: 수치형 컬럼 번호
    movie_int <- movie_USA[,int_idx]
    movie_cor <- cor(movie_int, method="pearson")
library(corrplot)
corrplot(movie_cor, method="square")

gross를 기준으로 다른 수치형 변수과 상관분석해보자.

which(names(movie_int) == 'gross') # 14번
## [1] 14
cor(movie_int[[14]], movie_int[,c(-14)], method="pearson")
##      aspect_ratio movie_facebook_likes title_year  duration
## [1,]    0.1249332            0.4172304 0.02458376 0.1887188
##      director_facebook_likes actor_1_facebook_likes actor_2_facebook_likes
## [1,]               0.2014891              0.2384787              0.2113491
##      actor_3_facebook_likes total_cast_facebook_likes imdb_score
## [1,]                0.21392                 0.3426516  0.1965112
##      num_voted_users num_critic_for_reviews num_user_for_reviews
## [1,]       0.6336051              0.4904244            0.5786224
##      facenumber_in_poster     budget title_length director_grade
## [1,]           -0.0190535 0.04651558    0.1043408      0.4412508
##      actor_1_grade actor_2_grade actor_3_grade
## [1,]     0.3279043     0.3115112     0.3313477
  • num_voted_users
  • num_user_for_reviews
  • num_critic_for_reviews
  • director_grade
  • movie_facebook_likes

포스터의 사람 얼굴 수, 타이틀 길이, 화면 비율, 개봉 년도는 0에 가까운 값을 갖는다. 즉 상관관계가 없다고 봐도 무방하기에 회귀분석 모델에 변수로 적합하지 않다.

주성분 분석

포스터의 사람 얼굴 수, 타이틀 길이, 화면 비율, 개봉 년도를 제외한 수치형 컬럼으로 주성분 분석을 해보자.

movie_int <- movie_int[,c("movie_facebook_likes",'director_facebook_likes','duration',
                           'actor_1_facebook_likes','actor_2_facebook_likes', 'actor_3_facebook_likes',
                           'total_cast_facebook_likes', 'imdb_score', 'num_voted_users','num_critic_for_reviews',
                           'num_user_for_reviews','director_grade','actor_1_grade','actor_2_grade','actor_3_grade', 'budget', 'gross')]
movie_cor <- cor(movie_int, method="pearson")
##한 주성분에 의해 가장 많이 설명되는 부분은?
max(eigen(movie_cor)$values) / sum(eigen(movie_cor)$values)
## [1] 0.3295414
##princomp
movie_prin = princomp(movie_int, cor=T, scores = T)

##주성분의 표준편차
movie_prin$sdev
##    Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7 
## 2.3668974 1.4218331 1.0977229 1.0523928 1.0350248 0.9967623 0.9369996 
##    Comp.8    Comp.9   Comp.10   Comp.11   Comp.12   Comp.13   Comp.14 
## 0.8864675 0.8249538 0.7425463 0.7026020 0.6024991 0.5575142 0.5376099 
##   Comp.15   Comp.16   Comp.17 
## 0.5064587 0.4846246 0.3940436
##주성분 계수(변수 적재 행렬)
movie_prin$loadings
## 
## Loadings:
##                           Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## movie_facebook_likes      -0.267  0.193 -0.302  0.145 -0.167 -0.132       
## director_facebook_likes   -0.188  0.214  0.380         0.197  0.153 -0.544
## duration                  -0.145  0.128  0.270 -0.112  0.357         0.701
## actor_1_facebook_likes    -0.225 -0.223  0.327 -0.255 -0.456 -0.130       
## actor_2_facebook_likes    -0.202 -0.297         0.624  0.155              
## actor_3_facebook_likes    -0.186 -0.324 -0.341 -0.383  0.364              
## total_cast_facebook_likes -0.265 -0.289               -0.120              
## imdb_score                -0.188  0.280  0.214         0.241         0.264
## num_voted_users           -0.335  0.260 -0.118                            
## num_critic_for_reviews    -0.310  0.183 -0.268        -0.179              
## num_user_for_reviews      -0.298  0.274 -0.166                            
## director_grade            -0.277  0.174  0.321         0.130  0.137 -0.289
## actor_1_grade             -0.250 -0.259  0.291 -0.171 -0.341         0.107
## actor_2_grade             -0.245 -0.319  0.104  0.473  0.131              
## actor_3_grade             -0.243 -0.328 -0.231 -0.296  0.308              
## budget                                  -0.103        -0.282  0.941  0.142
## gross                     -0.291  0.114 -0.203        -0.101              
##                           Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13
## movie_facebook_likes       0.420 -0.342 -0.200           0.268         
## director_facebook_likes          -0.399         -0.262   0.305  -0.188 
## duration                  -0.249 -0.432                                
## actor_1_facebook_likes                   0.105  -0.195   0.158   0.558 
## actor_2_facebook_likes                   0.119                   0.309 
## actor_3_facebook_likes     0.135                                 0.272 
## total_cast_facebook_likes        -0.179  0.720   0.321  -0.109  -0.386 
## imdb_score                 0.491  0.585  0.164           0.170  -0.102 
## num_voted_users           -0.155  0.170  0.187  -0.305   0.123         
## num_critic_for_reviews     0.268 -0.193 -0.185          -0.375         
## num_user_for_reviews      -0.363         0.172  -0.411  -0.405         
## director_grade                          -0.147   0.523  -0.413   0.308 
## actor_1_grade                     0.151 -0.419                  -0.418 
## actor_2_grade                           -0.227                  -0.149 
## actor_3_grade                     0.119 -0.183                  -0.145 
## budget                                                                 
## gross                     -0.495  0.179          0.451   0.518         
##                           Comp.14 Comp.15 Comp.16 Comp.17
## movie_facebook_likes              -0.204   0.438   0.280 
## director_facebook_likes                   -0.191         
## duration                                                 
## actor_1_facebook_likes    -0.279   0.139                 
## actor_2_facebook_likes     0.209  -0.438  -0.304         
## actor_3_facebook_likes     0.527   0.262                 
## total_cast_facebook_likes                                
## imdb_score                         0.109  -0.136   0.174 
## num_voted_users                   -0.185   0.285  -0.696 
## num_critic_for_reviews             0.271  -0.560  -0.273 
## num_user_for_reviews                               0.547 
## director_grade                    -0.139   0.266         
## actor_1_grade              0.423  -0.218                 
## actor_2_grade             -0.212   0.573   0.332         
## actor_3_grade             -0.606  -0.369  -0.116         
## budget                                                   
## gross                              0.138  -0.240   0.118 
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.059  0.059  0.059  0.059  0.059  0.059  0.059  0.059
## Cumulative Var  0.059  0.118  0.176  0.235  0.294  0.353  0.412  0.471
##                Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## SS loadings     1.000   1.000   1.000   1.000   1.000   1.000   1.000
## Proportion Var  0.059   0.059   0.059   0.059   0.059   0.059   0.059
## Cumulative Var  0.529   0.588   0.647   0.706   0.765   0.824   0.882
##                Comp.16 Comp.17
## SS loadings      1.000   1.000
## Proportion Var   0.059   0.059
## Cumulative Var   0.941   1.000
##주성분의 개수 구하기
library(knitr)
library(rgl)
knit_hooks$set(webgl = hook_webgl)
### 주성분 개수만큼 색상 지정
color_custum = c("#FF0000", "#FF5E00", "#FFBB00", "#FFE400", "#ABF200", "#1DDB16", "#00D8FF", "#0054FF", "#0100FF","#5F00FF", "#FF00DD", "#FF007F", "#CC3D3D", "#CC723D", "#CCA63D", "#C4B73B", "#9FC93C")

### 좌표평면 생성(직육면체)
coords<-NULL
color_custum2 = NULL
color_custum2 = rep('black',34)

for (i in 1:nrow(movie_prin$loadings)) { #nrow(movie_USA_prin$loadings)
    coords <- rbind(coords, rbind(c(0,0,0),movie_prin$loadings[i,1:3]*20))
    color_custum2[i*2] = color_custum[i] 
}

You must enable Javascript to view this page properly.

  • duration / director_grade / director_facebook_likes / imdb_score
  • num_voted_users / num_user_for reviews / movie_facebook_likes / num_critic_for_reviews / gross
  • actor_1_grade / actor_1_facebook_likes
  • actor_2_grade / actor_2_facebook_likes / total_cast_facebook_likes
  • actor_3_grade / actor_3_facebook_likes

3. 명목형 변수 선택

movie_USA 데이터셋 중 gross에 영향을 미칠 명목형 변수들 선택.

나라+언어, 장르, 관람 등급을 기준으로 사용.

(1) 나라 + 언어

나라와 언어는 동질성이 있다고 여겨지며 대부분의 데이터가 English, USA & UK에 몰려있으므로 언어라는 하나의 변수로 통합하고 언어의 범주는 \(English와 Other\)로 구분한다.

그러나 언어가 English에 90%이상이 속해있어 명목형 변수로 선택하여 데이터를 나누기에는 일반적인 견해만 보여주므로 moive_USA데이터 셋을 구분하는 명목형 변수로 사용하지 않는다.

(2) 장르

장르가 gross에 어떤 영향을 끼치는지 보기위해 장르 컬럼을 따로 전처리해야 한다. 장르컬럼이 ‘|’를 구분자로 여러개의 값들이 들어가 있기때문에 비슷한 장르 축소를 위해, 장르를 하나의 단어사전으로 만들어 28개의 장르를 주성분 분석 해보자.

# genres 가져오기
genres_USA <- movie_USA$genres
head(genres_USA)
## [1] "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Thriller"      
## [3] "Action|Thriller"                 "Action|Adventure|Fantasy|Sci-Fi"
## [5] "Action|Adventure|Fantasy"        "Action|Adventure|Sci-Fi"
(2-1)

장르 전처리

# | 제거
genres_df <- str_replace_all(genres_USA, '[|]',' ')
# Sci-Fi -> ScienceFiction
genres_df <- str_replace_all(genres_df, 'Sci-Fi', 'ScienceFiction')
# Film-Noir -> FilmNoir
genres_df <- str_replace_all(genres_df, 'Film-Noir', 'FilmNoir')
# Reality-TV -> RealityTV
genres_df <- str_replace_all(genres_df, 'Reality-TV', 'RealityTV')
# Talk-Show -> TalkShow
genres_df <- str_replace_all(genres_df, 'Talk-Show', 'TalkShow')
# Game-Show -> GameShow
genres_df <- str_replace_all(genres_df, 'Game-Show', 'GameShow')

# data.frame 변환
genres_df <- data.frame(genres_df)
names(genres_df) <- c('genres')
head(genres_df)
##                                    genres
## 1 Action Adventure Fantasy ScienceFiction
## 2               Action Adventure Thriller
## 3                         Action Thriller
## 4 Action Adventure Fantasy ScienceFiction
## 5                Action Adventure Fantasy
## 6         Action Adventure ScienceFiction
(2-2)

전처리된 장르로 Corpus객체 만들기.

# 단어 사전 만들기
#install.packages("tm")
require(tm)
genres_corpus <- Corpus(VectorSource(genres_df$genres))
(2-3)

Coupus객체로 DTM생성

## corpus확인
# genres_corpus
# str(genres_corpus)
# inspect(head(genres_corpus))

# 문서와 단어 집계표
genres_dtm <- DocumentTermMatrix(genres_corpus)
(2-4)

DTM객체 -> Matrix -> DataFrame 형변환

# data.frame으로 변환
genres_matrix <- as.matrix(genres_dtm)
genres_result <- as.data.frame(genres_matrix)
head(genres_result)
##   action adventure fantasy sciencefiction thriller western animation
## 1      1         1       1              1        0       0         0
## 2      1         1       0              0        1       0         0
## 3      1         0       0              0        1       0         0
## 4      1         1       1              1        0       0         0
## 5      1         1       1              0        0       0         0
## 6      1         1       0              1        0       0         0
##   comedy family musical romance mystery drama history sport crime horror
## 1      0      0       0       0       0     0       0     0     0      0
## 2      0      0       0       0       0     0       0     0     0      0
## 3      0      0       0       0       0     0       0     0     0      0
## 4      0      0       0       0       0     0       0     0     0      0
## 5      0      0       0       0       0     0       0     0     0      0
## 6      0      0       0       0       0     0       0     0     0      0
##   biography war music documentary news short filmnoir
## 1         0   0     0           0    0     0        0
## 2         0   0     0           0    0     0        0
## 3         0   0     0           0    0     0        0
## 4         0   0     0           0    0     0        0
## 5         0   0     0           0    0     0        0
## 6         0   0     0           0    0     0        0
(2-5)

펼쳐진 장르로 주성분 분

################### 주성분 분석 #############################
pc <- prcomp(genres_result)
plot(pc)

(2-6)

고유값, 고유벡터 확인

en <- eigen(cor(genres_result)) # $values : 고유값, $vectors : 고유벡터
#$values : 고유값(스칼라) 보기
en$values # $values : 결과치가 1이상이면 요인으로 본다.
##  [1] 2.6484115 2.3915624 1.6576025 1.3041511 1.2748943 1.2124926 1.1183045
##  [8] 1.0431824 1.0123646 0.9963950 0.9869085 0.9615677 0.8793074 0.8337590
## [15] 0.8185040 0.7510886 0.7165941 0.6303705 0.5816751 0.5688572 0.4443235
## [22] 0.4250463 0.4043951 0.3382421
en$vectors
##               [,1]          [,2]         [,3]         [,4]         [,5]
##  [1,]  0.056675404 -0.3847525822 -0.253171562  0.310793679  0.266089670
##  [2,]  0.360674036 -0.1840566293 -0.319554995  0.076272218  0.113512056
##  [3,]  0.348191536 -0.0491530129 -0.069603577 -0.192798563 -0.066154644
##  [4,]  0.134427094 -0.3038222463 -0.126493229  0.269426527 -0.104266801
##  [5,] -0.212557389 -0.4547952273  0.021804763 -0.200833683  0.102731972
##  [6,] -0.018629634  0.0188288740 -0.119960236  0.080066438  0.185865796
##  [7,]  0.408387946  0.0275464267 -0.111579311 -0.317892023  0.060943832
##  [8,]  0.268258528  0.2465346811  0.325803391  0.203789760  0.136118959
##  [9,]  0.451603443  0.0737652344 -0.070300821 -0.280285041  0.035085105
## [10,]  0.156827633  0.1210666868  0.024248560 -0.328152152  0.027525433
## [11,] -0.028862056  0.3032392002  0.190491985  0.118223978  0.204434805
## [12,] -0.120252982 -0.2439345189  0.138838855 -0.437573867 -0.121010064
## [13,] -0.310956202  0.2567319029 -0.116256032 -0.248855288  0.203455136
## [14,] -0.143005220  0.1433898395 -0.496496484 -0.072190049 -0.064542096
## [15,] -0.023248371  0.1381821051 -0.079334133  0.112062949 -0.149054982
## [16,] -0.198049432 -0.1864057529  0.128768881 -0.169670412  0.439248232
## [17,] -0.051047179 -0.2430753075  0.150442842 -0.150070447 -0.532390087
## [18,] -0.158747538  0.2036775357 -0.361930147 -0.125745208 -0.109554141
## [19,] -0.113442318  0.1013023670 -0.420344745 -0.043926275  0.007203942
## [20,] -0.009372221  0.1656873109  0.044796440 -0.025527205 -0.125914143
## [21,] -0.026101039  0.0615470666 -0.050586887  0.194536719 -0.417793283
## [22,] -0.007903276  0.0006151687  0.007127273  0.066962267 -0.126327068
## [23,]  0.021537153  0.0186316896  0.008921198  0.007355877 -0.102511439
## [24,] -0.020229436 -0.0185273969  0.021780115 -0.122567948  0.047044450
##              [,6]         [,7]        [,8]          [,9]         [,10]
##  [1,]  0.07781567  0.034658981  0.14014282  8.248704e-05  0.0086993670
##  [2,] -0.03034451  0.039362180  0.05395627  9.280389e-03  0.0625714545
##  [3,] -0.09952218  0.001629058 -0.10650463 -6.030317e-02 -0.0276565285
##  [4,] -0.18247803 -0.021696353  0.29064486 -1.828103e-02  0.1314746755
##  [5,]  0.05745830  0.010836088  0.05600029  1.342772e-02 -0.0556670828
##  [6,] -0.04538338  0.145585167 -0.16820658  1.524072e-01 -0.0170270106
##  [7,]  0.11205393 -0.043280172 -0.05383395 -4.533747e-02 -0.0326484150
##  [8,]  0.01053396 -0.042617038 -0.17494189 -1.107944e-02 -0.0141811350
##  [9,]  0.11035235 -0.110000592 -0.04342740  1.013120e-02 -0.0270923654
## [10,]  0.06397267  0.250905541  0.32268215  1.305488e-01 -0.0754355168
## [11,] -0.31445832  0.180129713  0.03406623  7.816309e-02  0.0615974097
## [12,] -0.04799016 -0.003824692 -0.07433281  3.666218e-02  0.0493912668
## [13,] -0.11411843 -0.060962036  0.10978837 -2.843498e-03  0.0204407032
## [14,] -0.06339427  0.099938263 -0.14835334 -1.576068e-02 -0.0227825699
## [15,]  0.17246903 -0.686509696  0.06105398  2.188348e-01  0.0000121578
## [16,]  0.43780160 -0.032132785 -0.07669407 -2.539939e-02 -0.1168218174
## [17,] -0.30783258  0.002862913 -0.08232360  1.298184e-02 -0.0673933914
## [18,]  0.12124473 -0.262848485  0.22313723 -2.786871e-02 -0.0243170356
## [19,] -0.16439745  0.266538286 -0.29768832 -1.975017e-03 -0.0069133276
## [20,]  0.19975192  0.296782413  0.55993276 -4.041951e-01  0.1117956868
## [21,]  0.50372040  0.204843396 -0.10364039 -1.074309e-02  0.0481285498
## [22,]  0.33310867  0.171740403 -0.39692167 -1.982032e-01 -0.0271372586
## [23,]  0.17526117  0.279089761  0.14718859  8.286173e-01 -0.0430823372
## [24,]  0.06385421 -0.025696602 -0.10802029  8.147324e-02  0.9609746635
##              [,11]        [,12]       [,13]        [,14]       [,15]
##  [1,]  0.061949520  0.030023412 -0.11085068 -0.075468848  0.16329478
##  [2,] -0.043602978  0.092260064  0.17909007 -0.047350020 -0.02285845
##  [3,] -0.005590118  0.029774340  0.12341107 -0.606161540  0.28400773
##  [4,]  0.016091227  0.252933185 -0.00581866  0.339802078 -0.22621303
##  [5,]  0.048211074  0.032042025 -0.01129328  0.001512408  0.06108903
##  [6,] -0.890673468 -0.192276064 -0.06443147  0.063099342 -0.05228645
##  [7,] -0.013271112 -0.024019160  0.06640033  0.200438452 -0.14422332
##  [8,]  0.182025288 -0.203432863 -0.21465422  0.151386212 -0.27873114
##  [9,] -0.006450087 -0.007060929  0.09245646  0.074194335 -0.09772476
## [10,] -0.037140554  0.178945383 -0.53656296  0.284363008  0.44526945
## [11,]  0.012744777  0.296032486  0.11812605  0.003567509  0.13830072
## [12,] -0.070001031  0.057199718  0.34601476  0.347447354 -0.16006526
## [13,] -0.042034272  0.263020427  0.28668543 -0.060874546  0.04311474
## [14,]  0.157214514 -0.133695340 -0.20415359  0.033246913 -0.20080831
## [15,] -0.103509601  0.125275383  0.02180079  0.150773972  0.25418014
## [16,]  0.094509042 -0.142731694 -0.13412228 -0.072267744 -0.04621726
## [17,] -0.091101511 -0.093826829 -0.32008466 -0.148660555 -0.03326146
## [18,] -0.047637407  0.024772616 -0.24256821 -0.189803459 -0.28955925
## [19,]  0.248562738 -0.118889164  0.05177526  0.244752599  0.16521033
## [20,] -0.119134612 -0.128001729  0.13093684 -0.100920086 -0.17769945
## [21,] -0.015927828 -0.165690953  0.26120845  0.155436138  0.34138962
## [22,] -0.102188453  0.726385135 -0.16798740 -0.072728659 -0.19431855
## [23,]  0.122125105  0.054997519  0.10340670 -0.198868654 -0.27474793
## [24,]  0.016398306 -0.040221028 -0.15776516 -0.070777056  0.03029140
##              [,16]        [,17]        [,18]        [,19]        [,20]
##  [1,]  0.108809610 -0.250117300  0.046743171 -0.023612209  0.011918781
##  [2,] -0.062410693 -0.087280219 -0.111530991 -0.548513147 -0.014101545
##  [3,] -0.048313612 -0.143286710 -0.197533205  0.487259396  0.040023458
##  [4,] -0.150388258  0.224479326 -0.003332828  0.548065370  0.055421617
##  [5,]  0.050395589 -0.069924140  0.296052474 -0.014971998 -0.020214245
##  [6,]  0.033293591 -0.013174062  0.001990173  0.133040748  0.027305620
##  [7,]  0.004025411  0.155961181  0.474482226  0.004799879  0.036407392
##  [8,]  0.060558797 -0.150801784 -0.177426888  0.093461154  0.062656610
##  [9,]  0.055925884  0.036902581  0.159941736  0.045797825 -0.050301170
## [10,] -0.063291410 -0.006057541 -0.217398583 -0.025450180 -0.032408714
## [11,] -0.236418503 -0.441884731  0.514675496  0.049256259  0.108181972
## [12,] -0.095542026 -0.487001930 -0.338724964  0.029093158  0.066065725
## [13,]  0.038035155  0.410117965 -0.044535876  0.035268504 -0.140926615
## [14,] -0.191663252 -0.258448617  0.053620383  0.142078596 -0.640984146
## [15,]  0.366622756 -0.255898298  0.087590685  0.101525625 -0.135209411
## [16,] -0.046494252 -0.011483533  0.125456346  0.241542305  0.060206007
## [17,]  0.136319010  0.036947616  0.311843310 -0.096766971  0.018928742
## [18,] -0.316816215 -0.143448373 -0.005671683 -0.051340081  0.564200632
## [19,]  0.472542645 -0.006565593  0.003991722  0.128362042  0.433431060
## [20,]  0.421172826 -0.213176507  0.071182401  0.066799225 -0.082766340
## [21,] -0.368024858  0.047755125  0.130685688  0.052821795  0.051766147
## [22,]  0.165815725 -0.039554154 -0.032917605 -0.019994064 -0.014608592
## [23,]  0.148879010  0.011827033  0.006266205  0.041641265 -0.003768174
## [24,]  0.018200636  0.030312622  0.027908353 -0.011654078  0.001861088
##              [,21]        [,22]        [,23]        [,24]
##  [1,] -0.328550936 -0.047653782  0.546054903  0.235001796
##  [2,]  0.410258240 -0.157572159 -0.298304303  0.231175238
##  [3,] -0.048002386 -0.116849817 -0.143637574  0.050635763
##  [4,]  0.187767531 -0.009881887 -0.094917164  0.050598590
##  [5,] -0.340612878  0.296167875 -0.593127761  0.168310498
##  [6,] -0.028802362  0.030906057 -0.058262863  0.053276057
##  [7,] -0.302004890 -0.515291475  0.028807493 -0.142373334
##  [8,] -0.181870524 -0.072936128 -0.201602857  0.545346183
##  [9,]  0.139012057  0.717997446  0.267784556  0.118040018
## [10,]  0.014070824 -0.032461989 -0.043120547  0.035684565
## [11,]  0.142510064  0.024751927  0.003578679  0.027536752
## [12,] -0.033683154 -0.108486693  0.180342730  0.055740215
## [13,] -0.070135311 -0.104496915  0.118318318  0.556838569
## [14,]  0.021865077 -0.014191548 -0.047384108  0.015624514
## [15,]  0.103496568 -0.101695694 -0.119993308  0.028379193
## [16,]  0.552019450 -0.168548537  0.061629526  0.057090094
## [17,]  0.238930208 -0.121577390  0.194431870  0.352287158
## [18,] -0.090718595  0.055344083 -0.019896808  0.065865207
## [19,]  0.111790260  0.029347017 -0.067699306  0.004991189
## [20,]  0.056689979 -0.027647191 -0.055556786  0.028480757
## [21,] -0.019603975  0.003882180 -0.003780037  0.272723818
## [22,] -0.033877456  0.007142376 -0.007810703 -0.007575834
## [23,] -0.024440339 -0.035185425  0.004969585 -0.013865102
## [24,] -0.009659855  0.005613982  0.003764460 -0.003174772
# - 고유값이란 어떤 행렬(상관관계수 행렬)로부터 유도되는 특정한 실수값

# 시각화
plot(en$values, type="o")

cor(genres_result)
##                      action    adventure      fantasy sciencefiction
## action          1.000000000  0.327083047  0.056734790    0.305544268
## adventure       0.327083047  1.000000000  0.288423964    0.250880162
## fantasy         0.056734790  0.288423964  1.000000000    0.026992043
## sciencefiction  0.305544268  0.250880162  0.026992043    1.000000000
## thriller        0.309037976 -0.034625218 -0.109116726    0.133198401
## western         0.036122944  0.041243398 -0.035790861   -0.036292183
## animation      -0.028418375  0.326163249  0.283704815    0.062310081
## comedy         -0.188745545 -0.030986507  0.057661582   -0.102162100
## family         -0.064688179  0.337307448  0.358433947    0.029564655
## musical        -0.081043515  0.031548813  0.092134955   -0.046017208
## romance        -0.190352160 -0.141036424 -0.049335258   -0.141257663
## mystery        -0.044073842 -0.056711597 -0.038812166    0.036923331
## drama          -0.266604333 -0.273884121 -0.199178606   -0.219862530
## history        -0.009663113 -0.011701320 -0.076529094   -0.077212175
## sport          -0.045424653 -0.070487239 -0.058869236   -0.056033693
## crime           0.148951512 -0.166666514 -0.151870378   -0.131271582
## horror         -0.048599227 -0.103119031  0.045738906    0.099014039
## biography      -0.101995382 -0.080849217 -0.095287677   -0.099124989
## war             0.025058975  0.001623951 -0.058869236   -0.077919581
## music          -0.098196729 -0.075220374 -0.040862802   -0.070099462
## documentary    -0.068374049 -0.058436715 -0.050774045   -0.051305077
## news           -0.008629398 -0.007708442 -0.005749809   -0.005801130
## short          -0.012205269  0.016465148 -0.008132429   -0.008205017
## filmnoir       -0.008629398 -0.007708442 -0.005749809   -0.005801130
##                    thriller      western     animation       comedy
## action          0.309037976  0.036122944 -0.0284183755 -0.188745545
## adventure      -0.034625218  0.041243398  0.3261632490 -0.030986507
## fantasy        -0.109116726 -0.035790861  0.2837048147  0.057661582
## sciencefiction  0.133198401 -0.036292183  0.0623100810 -0.102162100
## thriller        1.000000000 -0.043284064 -0.1300563645 -0.388574871
## western        -0.043284064  1.000000000 -0.0026082912 -0.055142951
## animation      -0.130056364 -0.002608291  1.0000000000  0.178075696
## comedy         -0.388574871 -0.055142951  0.1780756958  1.000000000
## family         -0.217636220 -0.022965532  0.5507282264  0.235060095
## musical        -0.101441993 -0.010072902  0.1825356934  0.066456474
## romance        -0.225668763  0.014607453 -0.0739962721  0.176949329
## mystery         0.322166606 -0.029839787 -0.0353793012 -0.196482911
## drama          -0.040004922  0.019915000 -0.1655623321 -0.251384161
## history        -0.069386572  0.041271702 -0.0412012799 -0.147129430
## sport          -0.121446769 -0.016910976 -0.0194833899  0.006217438
## crime           0.354794263  0.002455341 -0.0890845606 -0.088751375
## horror          0.224019383 -0.042637299 -0.0723849532 -0.171285946
## biography      -0.103894156  0.012577854 -0.0511313850 -0.153662123
## war            -0.041948470  0.030905242 -0.0250317103 -0.126638646
## music          -0.121355600 -0.018156283 -0.0005479914  0.023462531
## documentary    -0.095336593 -0.019114184 -0.0265633525 -0.073089737
## news           -0.009830895 -0.001971012 -0.0035159113 -0.012358992
## short          -0.013904646 -0.002787764 -0.0049728435  0.004929392
## filmnoir        0.024300081 -0.001971012 -0.0035159113 -0.012358992
##                      family      musical       romance      mystery
## action         -0.064688179 -0.081043515 -0.1903521596 -0.044073842
## adventure       0.337307448  0.031548813 -0.1410364245 -0.056711597
## fantasy         0.358433947  0.092134955 -0.0493352581 -0.038812166
## sciencefiction  0.029564655 -0.046017208 -0.1412576629  0.036923331
## thriller       -0.217636220 -0.101441993 -0.2256687626  0.322166606
## western        -0.022965532 -0.010072902  0.0146074525 -0.029839787
## animation       0.550728226  0.182535693 -0.0739962721 -0.035379301
## comedy          0.235060095  0.066456474  0.1769493294 -0.196482911
## family          1.000000000  0.194340692 -0.0609257393 -0.055151947
## musical         0.194340692  1.000000000  0.0700751487 -0.037223196
## romance        -0.060925739  0.070075149  1.0000000000 -0.104118187
## mystery        -0.055151947 -0.037223196 -0.1041181865  1.000000000
## drama          -0.200012464 -0.013694147  0.1739720921 -0.002318738
## history        -0.071398856 -0.020194641 -0.0141335951 -0.064567743
## sport           0.043301279 -0.035198981 -0.0188413294 -0.069228661
## crime          -0.138791554 -0.051427812 -0.1381618838  0.126354445
## horror         -0.109824800 -0.037413839 -0.1590965288  0.179612307
## biography      -0.063465344  0.013919500 -0.0208960231 -0.081621568
## war            -0.068395480 -0.020636644  0.0009699831 -0.045235912
## music           0.007994513  0.083423587  0.0480286452 -0.056205774
## documentary    -0.044647549 -0.025413395 -0.0830780015 -0.044553566
## news           -0.005646449 -0.002620572 -0.0085668166 -0.005154089
## short           0.025926927  0.062614361  0.0136621344 -0.007289854
## filmnoir       -0.005646449 -0.002620572 -0.0085668166  0.046349902
##                       drama      history        sport        crime
## action         -0.266604333 -0.009663113 -0.045424653  0.148951512
## adventure      -0.273884121 -0.011701320 -0.070487239 -0.166666514
## fantasy        -0.199178606 -0.076529094 -0.058869236 -0.151870378
## sciencefiction -0.219862530 -0.077212175 -0.056033693 -0.131271582
## thriller       -0.040004922 -0.069386572 -0.121446769  0.354794263
## western         0.019915000  0.041271702 -0.016910976  0.002455341
## animation      -0.165562332 -0.041201280 -0.019483390 -0.089084561
## comedy         -0.251384161 -0.147129430  0.006217438 -0.088751375
## family         -0.200012464 -0.071398856  0.043301279 -0.138791554
## musical        -0.013694147 -0.020194641 -0.035198981 -0.051427812
## romance         0.173972092 -0.014133595 -0.018841329 -0.138161884
## mystery        -0.002318738 -0.064567743 -0.069228661  0.126354445
## drama           1.000000000  0.164141827  0.062962887  0.047454746
## history         0.164141827  1.000000000  0.005933504 -0.058123908
## sport           0.062962887  0.005933504  1.000000000 -0.080927771
## crime           0.047454746 -0.058123908 -0.080927771  1.000000000
## horror         -0.212188985 -0.064758050 -0.069412112 -0.109134785
## biography       0.225212655  0.313287698  0.168764184 -0.017192290
## war             0.161435506  0.328182608 -0.031040203 -0.080927771
## music           0.058121795 -0.008899522 -0.044635632 -0.058202747
## documentary    -0.119823622  0.043458472  0.050742481 -0.038314627
## news           -0.016054638 -0.003179611 -0.003208742  0.032277300
## short          -0.022707399 -0.004497186 -0.004538388 -0.010468163
## filmnoir        0.014879908 -0.003179611 -0.003208742  0.032277300
##                      horror    biography           war         music
## action         -0.048599227 -0.101995382  0.0250589752 -0.0981967288
## adventure      -0.103119031 -0.080849217  0.0016239514 -0.0752203739
## fantasy         0.045738906 -0.095287677 -0.0588692359 -0.0408628019
## sciencefiction  0.099014039 -0.099124989 -0.0779195807 -0.0700994624
## thriller        0.224019383 -0.103894156 -0.0419484697 -0.1213555996
## western        -0.042637299  0.012577854  0.0309052420 -0.0181562825
## animation      -0.072384953 -0.051131385 -0.0250317103 -0.0005479914
## comedy         -0.171285946 -0.153662123 -0.1266386460  0.0234625308
## family         -0.109824800 -0.063465344 -0.0683954803  0.0079945135
## musical        -0.037413839  0.013919500 -0.0206366440  0.0834235867
## romance        -0.159096529 -0.020896023  0.0009699831  0.0480286452
## mystery         0.179612307 -0.081621568 -0.0452359117 -0.0562057738
## drama          -0.212188985  0.225212655  0.1614355058  0.0581217946
## history        -0.064758050  0.313287698  0.3281826080 -0.0088995221
## sport          -0.069412112  0.168764184 -0.0310402032 -0.0446356321
## crime          -0.109134785 -0.017192290 -0.0809277705 -0.0582027472
## horror          1.000000000 -0.085085413 -0.0654217707 -0.0718866445
## biography      -0.085085413  1.000000000  0.1007142781  0.0844408721
## war            -0.065421771  0.100714278  1.0000000000 -0.0271174889
## music          -0.071886644  0.084440872 -0.0271174889  1.0000000000
## documentary    -0.050115013  0.026406200  0.0261845592  0.0946196764
## news           -0.005167747 -0.004081984 -0.0032087417 -0.0033231329
## short          -0.007309172 -0.005773487 -0.0045383881 -0.0047001811
## filmnoir       -0.005167747 -0.004081984 -0.0032087417 -0.0033231329
##                 documentary          news        short      filmnoir
## action         -0.068374049 -0.0086293977 -0.012205269 -0.0086293977
## adventure      -0.058436715 -0.0077084425  0.016465148 -0.0077084425
## fantasy        -0.050774045 -0.0057498090 -0.008132429 -0.0057498090
## sciencefiction -0.051305077 -0.0058011304 -0.008205017 -0.0058011304
## thriller       -0.095336593 -0.0098308949 -0.013904646  0.0243000810
## western        -0.019114184 -0.0019710116 -0.002787764 -0.0019710116
## animation      -0.026563353 -0.0035159113 -0.004972843 -0.0035159113
## comedy         -0.073089737 -0.0123589915  0.004929392 -0.0123589915
## family         -0.044647549 -0.0056464489  0.025926927 -0.0056464489
## musical        -0.025413395 -0.0026205721  0.062614361 -0.0026205721
## romance        -0.083078001 -0.0085668166  0.013662134 -0.0085668166
## mystery        -0.044553566 -0.0051540895 -0.007289854  0.0463499024
## drama          -0.119823622 -0.0160546380 -0.022707399  0.0148799084
## history         0.043458472 -0.0031796106 -0.004497186 -0.0031796106
## sport           0.050742481 -0.0032087417 -0.004538388 -0.0032087417
## crime          -0.038314627  0.0322772996 -0.010468163  0.0322772996
## horror         -0.050115013 -0.0051677474 -0.007309172 -0.0051677474
## biography       0.026406200 -0.0040819840 -0.005773487 -0.0040819840
## war             0.026184559 -0.0032087417 -0.004538388 -0.0032087417
## music           0.094619676 -0.0033231329 -0.004700181 -0.0033231329
## documentary     1.000000000  0.1031177498  0.071285630 -0.0023166869
## news            0.103117750  1.0000000000 -0.000337884 -0.0002388915
## short           0.071285630 -0.0003378840  1.000000000 -0.0003378840
## filmnoir       -0.002316687 -0.0002388915 -0.000337884  1.0000000000
(2-7)

주성분 개수로 요인분석

result <- factanal(genres_result, factors = 9, rotation = "varimax", scores="regression")
result
## 
## Call:
## factanal(x = genres_result, factors = 9, scores = "regression",     rotation = "varimax")
## 
## Uniquenesses:
##         action      adventure        fantasy sciencefiction       thriller 
##          0.390          0.524          0.754          0.774          0.252 
##        western      animation         comedy         family        musical 
##          0.981          0.522          0.005          0.356          0.927 
##        romance        mystery          drama        history          sport 
##          0.810          0.804          0.474          0.298          0.005 
##          crime         horror      biography            war          music 
##          0.005          0.406          0.790          0.828          0.951 
##    documentary           news          short       filmnoir 
##          0.005          0.987          0.994          0.997 
## 
## Loadings:
##                Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7
## action         -0.106   0.746                                   0.104 
## adventure       0.428   0.491                          -0.143         
## fantasy         0.441   0.142                                         
## sciencefiction          0.426                                  -0.148 
## thriller       -0.238   0.367                           0.717   0.107 
## western                                                               
## animation       0.678                                                 
## comedy          0.156  -0.164  -0.207                  -0.313         
## family          0.787                                                 
## musical         0.247  -0.103                                         
## romance                -0.297          -0.105          -0.165  -0.123 
## mystery                                                 0.395         
## drama          -0.184  -0.472   0.225  -0.159           0.124         
## history                         0.833                                 
## sport                                           0.993                 
## crime          -0.152                                   0.294   0.929 
## horror                                                  0.162  -0.108 
## biography              -0.156   0.370           0.155                 
## war                             0.392                                 
## music                  -0.157                                         
## documentary                             0.986                         
## news                                    0.107                         
## short                                                                 
## filmnoir                                                              
##                Factor8 Factor9
## action                  0.136 
## adventure               0.105 
## fantasy                       
## sciencefiction                
## thriller                      
## western                       
## animation                     
## comedy          0.883   0.133 
## family          0.107         
## musical                       
## romance                 0.187 
## mystery                -0.164 
## drama          -0.299   0.295 
## history                       
## sport                         
## crime                   0.110 
## horror                 -0.728 
## biography      -0.132         
## war                           
## music                         
## documentary                   
## news                          
## short                         
## filmnoir                      
## 
##                Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7
## SS loadings      1.699   1.543   1.128   1.061   1.047   0.982   0.966
## Proportion Var   0.071   0.064   0.047   0.044   0.044   0.041   0.040
## Cumulative Var   0.071   0.135   0.182   0.226   0.270   0.311   0.351
##                Factor8 Factor9
## SS loadings      0.958   0.778
## Proportion Var   0.040   0.032
## Cumulative Var   0.391   0.423
## 
## Test of the hypothesis that 9 factors are sufficient.
## The chi square statistic is 380.21 on 96 degrees of freedom.
## The p-value is 1.82e-35

몇개의 요인을 설정해도 유의미한 값(p-value >= 0.05)이 나오지 않는다.

장르는 요인을 설정해 차원을 축소 할 수 없다.

(3) 등급

미국에서 상영된 상위 5천개의 영화 등급은 각 나라별로 다른기준을 가지고 있다.

영화의 관람 등급이 gross에 영향을 미치는지 알아보기 위해 각 나라별의 다양한 등급을 미국의 대표 등급으로 변환해야 한다.

동질성 검정을 위해서 movie_USA데이터를 3개의 등급(PG, PG_13, R)으로 나눠야 한다. 이 때 ifelse()함수를 사용하자.

content_rating <- movie_USA$content_rating
content_rating <- ifelse(content_rating == "R" | content_rating == "M" | content_rating == 
    "NC-17" | content_rating == "18" | content_rating == "TV-MA", "R", ifelse(content_rating == 
    "PG-13" | content_rating == "TV-14" | content_rating == "15" | content_rating == 
    "TV-PG", "PG-13", ifelse(content_rating == "12" | content_rating == "All" | 
    content_rating == "PG" | content_rating == "G" | content_rating == "GP" | 
    content_rating == "TV-G" | content_rating == "TV-Y" | content_rating == 
    "TV-Y7", "PG", "Not Rated")))

movie_USA$content_rating <- content_rating

등급이 정해지지 않은 Not Rated를 처리하기위해 위의 (2)에서 생성한 영화의 장르_DTM을 이용하여 분류모델을 생성해보자. Not Rated를 처리하기 위해 분류가 된 데이터를 기반으로 모델을 생성하려 한다. 먼저 PG, PG_13, R을 대표값으로 데이터를 분류하자.

PG <- subset(movie_USA, movie_USA$content_rating=='PG')
PG_13 <- subset(movie_USA, movie_USA$content_rating=='PG-13')
R <- subset(movie_USA, movie_USA$content_rating=='R')

등급 별 장르의 개수를 파악하기 위해, 장르_DTM에 등급을 추가하자. 그리고 등급으로 그룹화 한 후, 그룹별 장르의 빈도 수를 구하였다.

# 등급 열 가장 앞에 추가
genres_result <- data.frame(content_rating, genres_result,stringsAsFactors = F)

# 그룹화 후 카운팅
content_group <- genres_result %>% group_by(content_rating) %>% summarise_each(funs(sum),everything())

# Not Rated 제거
content_group <- content_group[-1,]
names <- content_group$content_rating

다음은 등급 별 장르의 빈도 수를 시각화한 그래프이다.

Not Rated를 분류하기 위해, 장르를 이용하여 모델을 생성하자 먼저 train과 test를 7:3 비율로 생성하였다

IDEA :: PG, PG_13, R 등급에 따라 자주 분류되는 장르들이 있을 것이다.

genre_data <- subset(genres_result,genres_result$content_rating!='Not Rated')

# nnet 모델에 적용하기 위해 content_rating 열의 factor를 조정하였다
genre_data[,1] <- as.character(genre_data[,1])
genre_data[,1] <- as.factor(genre_data[,1])

# train, test 를 7:3 비율로 나누자
idx <- sample(nrow(genre_data), nrow(genre_data) * 0.7)
train <- genre_data[idx,]
test <- genre_data[-idx,]

인공신경망을 이용하여 모델을 생성한 후, 혼돈 매트릭스를 이용하여 정분류율을 확인해보자

library(nnet)
model <- nnet(content_rating ~., data=train, decay=5e-4, size=10, maxit=300)
p <- predict(model, test, 'class')
library(caret)
confusion <- confusionMatrix(p,test[,1])
confusion[[2]]
##           Reference
## Prediction  PG PG-13   R
##      PG    141    18   8
##      PG-13  41   214 145
##      R      40   165 437
confusion[[4]][,11]
##    Class: PG Class: PG-13     Class: R 
##    0.8043963    0.6549894    0.7047493

Not Rated를 분류함에 있어서 PG_13와 R이 조금 낮은 분류율을 보인다.

하지만, 실제 미국에서의 PG_13과 R에 대한 영화등급 분류 기준이 모호하므로 PG_13과 R의 분류 오차는 크리티컬한 것이라고 보기어렵다. 조금 낮은 분류율로 생각할 수 있지만 Not Rated를 분류함에 있어서는 유의미한 모델이라고 생각할 수 있다.

이제 등급이 책정되지 않은 Not Rated 를 모델로 분류해보자

먼저 단어사전에서 Not Rated 등급으로 되어있는 데이터셋을 추출하자

그리고 모델을 적용하여 분류한 후, 결과 content_rating 데이터를 수정한다

content_not <- subset(genres_result,genres_result$content_rating=='Not Rated')
p_not <- predict(model, content_not, 'class')
idx_not <- as.numeric(rownames(content_not))
content_rating[idx_not] <- p_not
movie_USA$content_rating <- content_rating

Not Rated가 PG, PG_13, R로 분류되었다면 다시 movie_USA를 PG, PG_13, R로 나누어야 한다.

PG <- subset(movie_USA, movie_USA$content_rating=='PG')
PG_13 <- subset(movie_USA, movie_USA$content_rating=='PG-13')
R <- subset(movie_USA, movie_USA$content_rating=='R')

관람 등급을 기준으로 PG, PG_13, R로 나누었다.

관람 등급이 gross에 영향을 미치는 변수인지 알아보기 위해, 등급 별 gross의 차이검정을 해보자.

var.test(PG$gross, PG_13$gross)
## 
##  F test to compare two variances
## 
## data:  PG$gross and PG_13$gross
## F = 0.90962, num df = 758, denom df = 1448, p-value = 0.139
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.804251 1.031318
## sample estimates:
## ratio of variances 
##            0.90962
var.test(PG$gross, R$gross)
## 
##  F test to compare two variances
## 
## data:  PG$gross and R$gross
## F = 4.3345, num df = 758, denom df = 1978, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  3.856346 4.887460
## sample estimates:
## ratio of variances 
##           4.334519
var.test(PG_13$gross, R$gross)
## 
##  F test to compare two variances
## 
## data:  PG_13$gross and R$gross
## F = 4.7652, num df = 1448, denom df = 1978, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  4.331013 5.246691
## sample estimates:
## ratio of variances 
##           4.765198

검증이 p-value가 0.05이하 이면 집단은 비동질성 분포라고 볼 수 있다.

세 집단의 분포에 대한 검증을 하기 위해 kruskal.test를 해보자.

x <- c(PG$gross, PG_13$gross, R$gross)
g <- factor(rep(1:3, c(length(PG$gross),length(PG_13$gross),length(R$gross))),
            labels=c('PG',"PG_13","R"))
kruskal.test(x,g)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  x and g
## Kruskal-Wallis chi-squared = 397.06, df = 2, p-value < 2.2e-16

kruskal.test의 결과 세 집단의 분포는 상이함을 알 수 있다. 따라서 wilcox.test를 적용해서 집단의 관람 등급에 따른 gross의 값이 차이가 있다고 말할 수 있는지 보자.

cf) 집단이 동질성 분포이면 t.test()를 사용한다.

wilcox.test(PG$gross, PG_13$gross, alter="two.sided", conf.int=TRUE, conf.level=0.95)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  PG$gross and PG_13$gross
## W = 584540, p-value = 0.01491
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##   519048 7974605
## sample estimates:
## difference in location 
##                4086294
wilcox.test(PG$gross, R$gross, alter="two.sided", conf.int=TRUE, conf.level=0.95)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  PG$gross and R$gross
## W = 1037300, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  19447561 26557124
## sample estimates:
## difference in location 
##               23009924
wilcox.test(R$gross, PG_13$gross, alter="two.sided", conf.int=TRUE, conf.level=0.95)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  R$gross and PG_13$gross
## W = 945970, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -18943791 -14163534
## sample estimates:
## difference in location 
##              -16517791

검증의 결과, 관람 등급에 따른 gross에는 차이가 있다고 말할 수 있다.

gross를 예측하는 명목형 변수로 관람등급은 유의미하다.

4. 다중회귀모델 생성

(1) 다중회귀에 필요한 데이터 추출하자

# 다중회귀에 필요한 칼럼만 추출
movie_USA[is.na(movie_USA)] <- 0

reg_movie <- movie_USA[, c("content_rating", "movie_facebook_likes", "director_facebook_likes", 
    "duration", "actor_1_facebook_likes", "actor_2_facebook_likes", "actor_3_facebook_likes", 
    "total_cast_facebook_likes", "imdb_score", "num_voted_users", "num_critic_for_reviews", 
    "num_user_for_reviews", "director_grade", "actor_1_grade", "actor_2_grade", 
    "actor_3_grade", "gross", "budget")]

영화 등급에 따라 수치형 데이터만 가지고 있는 subset을 다시 만들자.

PG <- subset(reg_movie, reg_movie$content_rating == 'PG')
PG_13 <- subset(reg_movie, reg_movie$content_rating == 'PG-13')
R <- subset(reg_movie, reg_movie$content_rating == 'R')

# 차원 확인
dim(PG); dim(PG_13); dim(R)
## [1] 759  18
## [1] 1449   18
## [1] 1979   18

이제 전처리 된 미국 영화 데이터셋으로 회귀분석모델에 필요한 train셋을 만들고

한국 데이터를 가지고 test셋을 만들어 모델을 평가할 것이다.

한국 데이터셋 역시 위의 전처리 과정을 거쳐야 한다. 여기서는 중복되는 과정이기에 함수로 만들어 처리하였다

movie_KR <- read.csv('movie_metadata_KR.csv', header = T, stringsAsFactors = F)
movie_KR <- kyj(movie_KR)

### 회귀모델에 사용되는 변수만 모은 데이터 셋
reg_KR <- movie_KR$reg_movie

이 프로젝트의 목적은 한국영화의 미국시장 진출 시 수익예측이다.

미국 영화 데이터셋으로 Train 데이터 셋을 만들고, 한국 영화 데이터 셋 중 수익(gross)이 있는 값들로 Test 데이터 셋을 만들어 모델을 평가한다.

최적의 모델이 생성 되었을 때, 모델을 이용해 수익(gross)이 없는 한국 데이터 셋의 수익(gross)를 예측한다.

## 한국 데이터 셋 중 수익이 있는 데이터 추출
KR_idx <- which(reg_KR[,'gross']!=0)

## 해당 데이터 셋을 모델의 test데이터로 사용한다.
test <- reg_KR[KR_idx,]
library(rJava) ## RWeka를 사용하기 위해 필요
library(RWeka) ## M5P함수 제공
CNT <- 1
for(k in 1:100){  
    
    cat(CNT, '  ')
    
    set.seed(CNT+100)
    ## 변수 7:3 랜덤 추출
    idx <- sample(1:nrow(reg_movie), nrow(reg_movie)*0.7)
    
    train <- reg_movie[idx,]
    
    PG <- subset(train, train$content_rating == 'PG')
    PG_13 <- subset(train, train$content_rating == 'PG-13')
    R <- subset(train, train$content_rating == 'R')
    
    PG_t <- subset(test, test$content_rating == 'PG')
    PG_13_t <- subset(test, test$content_rating == 'PG-13')
    R_t <- subset(test, test$content_rating == 'R')
    
    
    PG_model <- lm(gross~., data=PG[,-1])
    PG_pred <- predict(PG_model, PG_t[,-1])
    ACC[CNT,1] <- cor(PG_pred, PG_t$gross)
    
    
    PG_13_model <- lm(gross~., data=PG_13[-1])
    PG_13_pred <- predict(PG_13_model, PG_13_t[,-1])
    ACC[CNT,2] <- cor(PG_13_pred, PG_13_t$gross)
    
    
    R_model <- M5P(gross~., data=R[,-1])
    R_pred <- predict(R_model, R_t[,-1])
    ACC[CNT,3] <- cor(R_pred, R_t$gross) 
    
    
    CNT <- CNT + 1      
    
} # outer for K   
## 1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96   97   98   99   100

생성된 모델 중에서 가장 높은 예측률을 나타내는 모델의 seed번호를 가져와 최종 모델을 만든다.

max(apply(ACC, 1, mean))
## [1] 0.5521602
seed_max <- which.max(apply(ACC, 1, mean))
ACC[seed_max,]
##           PG     PG_13        R
## 47 0.1093992 0.7111903 0.835891
set.seed(seed_max+100)
idx <- sample(1:nrow(reg_movie), nrow(reg_movie)*0.7)

PG <- subset(train, train$content_rating == 'PG')
PG_13 <- subset(train, train$content_rating == 'PG-13')
R <- subset(train, train$content_rating == 'R')

PG_model <- lm(gross~., data=PG[,-1])
PG_13_model <- lm(gross~., data=PG_13[,-1])
R_model <- M5P(gross~., data=R[,-1])

생성된 모델을 이용해 한국 영화의 미국 시장 진출 시 수익을 예측하자!

gross_pred <- function(reg_KR){
    
    for(n in 1:nrow(reg_KR)){
        if(reg_KR[n,'content_rating']=='PG' & reg_KR[n,'gross']==0){
            reg_KR[n, 'gross'] <- predict(PG_model, reg_KR[n,])
        }else if(reg_KR[n,'content_rating']=='PG-13' & reg_KR[n,'gross']==0){
            reg_KR[n, 'gross'] <- predict(PG_13_model, reg_KR[n,])
        }else if(reg_KR[n,'content_rating']=='R' & reg_KR[n,'gross']==0){
            reg_KR[n, 'gross'] <- predict(R_model, reg_KR[n,])
        }
        # cat('\n', n)
    }
    return(reg_KR)
}

reg_KR[-KR_idx,] <- gross_pred(reg_KR[-KR_idx,])

Insight

상관관계

  1. gross에 영향을 주는 높은 상관관계
  • num_voted_users
  • um_user_for_reviews
  • num_critic_for_reviews
  • director_grade
  • movie_facebook_likes
  1. gross와 관련없는 상관관계
  • facenumber_in_poster
  • title_year
  • aspect_ratio
  • title_length

주성분 분석

  1. 방향성
  • duration / director_grade / director_facebook_likes / imdb_score
  • num_voted_users / num_user_for reviews / movie_facebook_likes / num_critic_for_reviews / gross
  • actor_1_grade / actor_1_facebook_likes
  • actor_2_grade / actor_2_facebook_likes / total_cast_facebook_likes
  • actor_3_grade / actor_3_facebook_likes

명목형 변수 처리

  1. 언어 + 나라는 gross를 비교하는데 적합한 변수가 아니다.

  2. 장르 하나하나는 주성분으로 묶이지 않는다.

  3. 장르 하나만으로 관람등급을 분류하기에는 무리가 있다.

  4. Garbage/Trash 데이터에 대한 처리가 필요하다.