1. Introduction

디지털 기술이 발전하면서 기존에 신문이나 방송과 같은 미디어를 기반으로 하던 보도기관들이 온라인 출판에 뛰어들었습니다. 이제 다수 뉴스 콘텐츠 유통이 온라인과 모바일 플랫폼을 중심으로 이루어지고 있습니다. 기사의 발행 건수도 과거에 비해 크게 늘었습니다. 대표적인 포털사이트 ‘네이버’와 ‘다음’이 하루 평균 5만건의 기사를 독자에게 전달하여, 1년으로 따지만 약 1800만 건에 달하고 있습니다. 이렇게 많은 기사 가운데 어떤 기사가 ‘인기있는’ 기사일까? 언론사 입장에서 이 질문은 편집 방향을 결정하거나 광고 수입을 확보하여 안정적 경영을 해나가는 데 있어 매우 중요할 것입니다. 이 보고서에서는 온라인 뉴스의 인기 여부를 분류할 수 있는 모델을 구축해보고자 합니다.

2. 데이터 설명

뉴스사이트 Mashable (www.mashable.com)에서 2년간 발행된 모든 기사를 수집한 것입니다. (2013.01.07.~2015.01.07.) 관측치 39797개와 61개 변수로 구성 되어있는 데이터 입니다. 다음 사이트에서 다운로드 받을 수 있습니다.

https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

기사에 대한공유(shares)가 많을수록 인기가 많은 것으로 판단하려고 합니다. 변수에 대한 자세한 설명은 아래와 같습니다.

연번 변수명 변수설명
0 url URL of the article (non-predictive)
1 timedelta Days between the article publication and the dataset acquisition (non-predictive)
2 n_tokens_title Number of words in the title
3 n_tokens_content Number of words in the content
4 n_unique_tokens Rate of unique words in the content
5 n_non_stop_words Rate of non-stop words in the content
6 n_non_stop_unique_tokens Rate of unique non-stop words in the content
7 num_hrefs Number of links
8 num_self_hrefs Number of links to other articles published by Mashable
9 num_imgs Number of images
10 num_videos Number of videos
11 average_token_length Average length of the words in the content
12 num_keywords Number of keywords in the metadata
13 data_channel_is_lifestyle Is data channel ‘Lifestyle’?
14 data_channel_is_entertainment Is data channel ‘Entertainment’?
15 data_channel_is_bus Is data channel ‘Business’?
16 data_channel_is_socmed Is data channel ‘Social Media’?
17 data_channel_is_tech Is data channel ‘Tech’?
18 data_channel_is_world Is data channel ‘World’?
19 kw_min_min Worst keyword (min
20 kw_max_min Worst keyword (max
21 kw_avg_min Worst keyword (avg
22 kw_min_max Best keyword (min
23 kw_max_max Best keyword (max
24 kw_avg_max Best keyword (avg
25 kw_min_avg Avg
26 kw_max_avg Avg
27 kw_avg_avg Avg
28 self_reference_min_shares Min
29 self_reference_max_shares Max
30 self_reference_avg_sharess Avg
31 weekday_is_monday Was the article published on a Monday?
32 weekday_is_tuesday Was the article published on a Tuesday?
33 weekday_is_wednesday Was the article published on a Wednesday?
34 weekday_is_thursday Was the article published on a Thursday?
35 weekday_is_friday Was the article published on a Friday?
36 weekday_is_saturday Was the article published on a Saturday?
37 weekday_is_sunday Was the article published on a Sunday?
38 is_weekend Was the article published on the weekend?
39 LDA_00 Closeness to LDA topic 0
40 LDA_01 Closeness to LDA topic 1
41 LDA_02 Closeness to LDA topic 2
42 LDA_03 Closeness to LDA topic 3
43 LDA_04 Closeness to LDA topic 4
44 global_subjectivity Text subjectivity
45 global_sentiment_polarity Text sentiment polarity
46 global_rate_positive_words Rate of positive words in the content
47 global_rate_negative_words Rate of negative words in the content
48 rate_positive_words Rate of positive words among non-neutral tokens
49 rate_negative_words Rate of negative words among non-neutral tokens
50 avg_positive_polarity Avg
51 min_positive_polarity Min
52 max_positive_polarity Max
53 avg_negative_polarity Avg
54 min_negative_polarity Min
55 max_negative_polarity Max
56 title_subjectivity Title subjectivity
57 title_sentiment_polarity Title polarity
58 abs_title_subjectivity Absolute subjectivity level
59 abs_title_sentiment_polarity Absolute polarity level
60 shares Number of shares (target)

3. GroundWork

3-1. 작업폴더 지정, 데이터 불러들이기

# 작업폴더 확인 
getwd()
## [1] "C:/Users/e/Documents/R"
# 데이터 불러오기
news <- read.csv("OnlineNewsPopularity.csv",
                 header = T)

3-2. 패키지 설치

# 패키지 설치
install.packages("dplyr", repos ="http://cran.us.r-project.org")
## Installing package into 'C:/Users/e/Documents/R/win-library/3.5'
## (as 'lib' is unspecified)
## package 'dplyr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\e\AppData\Local\Temp\RtmpKkxJBc\downloaded_packages
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
install.packages("randomForest", repos = "http://cran.us.r-project.org", dependencies = T)
## Installing package into 'C:/Users/e/Documents/R/win-library/3.5'
## (as 'lib' is unspecified)
## package 'randomForest' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\e\AppData\Local\Temp\RtmpKkxJBc\downloaded_packages
library("randomForest")
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
install.packages("caret", repos ="http://cran.us.r-project.org", dependencies = T)
## Installing package into 'C:/Users/e/Documents/R/win-library/3.5'
## (as 'lib' is unspecified)
## package 'caret' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\e\AppData\Local\Temp\RtmpKkxJBc\downloaded_packages
library("caret")
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin

3-3. 데이터 확인

# 데이터 확인
head(news)
##                                                              url timedelta
## 1   http://mashable.com/2013/01/07/amazon-instant-video-browser/       731
## 2    http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/       731
## 3 http://mashable.com/2013/01/07/apple-40-billion-app-downloads/       731
## 4       http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/       731
## 5               http://mashable.com/2013/01/07/att-u-verse-apps/       731
## 6               http://mashable.com/2013/01/07/beewi-smart-toys/       731
##   n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words
## 1             12              219       0.6635945                1
## 2              9              255       0.6047431                1
## 3              9              211       0.5751295                1
## 4              9              531       0.5037879                1
## 5             13             1072       0.4156456                1
## 6             10              370       0.5598886                1
##   n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos
## 1                0.8153846         4              2        1          0
## 2                0.7919463         3              1        1          0
## 3                0.6638655         3              1        1          0
## 4                0.6656347         9              0        1          0
## 5                0.5408895        19             19       20          0
## 6                0.6981982         2              2        0          0
##   average_token_length num_keywords data_channel_is_lifestyle
## 1             4.680365            5                         0
## 2             4.913725            4                         0
## 3             4.393365            6                         0
## 4             4.404896            7                         0
## 5             4.682836            7                         0
## 6             4.359459            9                         0
##   data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed
## 1                             1                   0                      0
## 2                             0                   1                      0
## 3                             0                   1                      0
## 4                             1                   0                      0
## 5                             0                   0                      0
## 6                             0                   0                      0
##   data_channel_is_tech data_channel_is_world kw_min_min kw_max_min
## 1                    0                     0          0          0
## 2                    0                     0          0          0
## 3                    0                     0          0          0
## 4                    0                     0          0          0
## 5                    1                     0          0          0
## 6                    1                     0          0          0
##   kw_avg_min kw_min_max kw_max_max kw_avg_max kw_min_avg kw_max_avg
## 1          0          0          0          0          0          0
## 2          0          0          0          0          0          0
## 3          0          0          0          0          0          0
## 4          0          0          0          0          0          0
## 5          0          0          0          0          0          0
## 6          0          0          0          0          0          0
##   kw_avg_avg self_reference_min_shares self_reference_max_shares
## 1          0                       496                       496
## 2          0                         0                         0
## 3          0                       918                       918
## 4          0                         0                         0
## 5          0                       545                     16000
## 6          0                      8500                      8500
##   self_reference_avg_sharess weekday_is_monday weekday_is_tuesday
## 1                    496.000                 1                  0
## 2                      0.000                 1                  0
## 3                    918.000                 1                  0
## 4                      0.000                 1                  0
## 5                   3151.158                 1                  0
## 6                   8500.000                 1                  0
##   weekday_is_wednesday weekday_is_thursday weekday_is_friday
## 1                    0                   0                 0
## 2                    0                   0                 0
## 3                    0                   0                 0
## 4                    0                   0                 0
## 5                    0                   0                 0
## 6                    0                   0                 0
##   weekday_is_saturday weekday_is_sunday is_weekend     LDA_00     LDA_01
## 1                   0                 0          0 0.50033120 0.37827893
## 2                   0                 0          0 0.79975569 0.05004668
## 3                   0                 0          0 0.21779229 0.03333446
## 4                   0                 0          0 0.02857322 0.41929964
## 5                   0                 0          0 0.02863281 0.02879355
## 6                   0                 0          0 0.02224528 0.30671758
##       LDA_02     LDA_03     LDA_04 global_subjectivity
## 1 0.04000468 0.04126265 0.04012254           0.5216171
## 2 0.05009625 0.05010067 0.05000071           0.3412458
## 3 0.03335142 0.03333354 0.68218829           0.7022222
## 4 0.49465083 0.02890472 0.02857160           0.4298497
## 5 0.02857518 0.02857168 0.88542678           0.5135021
## 6 0.02223128 0.02222429 0.62658158           0.4374086
##   global_sentiment_polarity global_rate_positive_words
## 1                0.09256198                 0.04566210
## 2                0.14894781                 0.04313725
## 3                0.32333333                 0.05687204
## 4                0.10070467                 0.04143126
## 5                0.28100348                 0.07462687
## 6                0.07118419                 0.02972973
##   global_rate_negative_words rate_positive_words rate_negative_words
## 1                0.013698630           0.7692308           0.2307692
## 2                0.015686275           0.7333333           0.2666667
## 3                0.009478673           0.8571429           0.1428571
## 4                0.020715631           0.6666667           0.3333333
## 5                0.012126866           0.8602151           0.1397849
## 6                0.027027027           0.5238095           0.4761905
##   avg_positive_polarity min_positive_polarity max_positive_polarity
## 1             0.3786364            0.10000000                   0.7
## 2             0.2869146            0.03333333                   0.7
## 3             0.4958333            0.10000000                   1.0
## 4             0.3859652            0.13636364                   0.8
## 5             0.4111274            0.03333333                   1.0
## 6             0.3506100            0.13636364                   0.6
##   avg_negative_polarity min_negative_polarity max_negative_polarity
## 1            -0.3500000                -0.600            -0.2000000
## 2            -0.1187500                -0.125            -0.1000000
## 3            -0.4666667                -0.800            -0.1333333
## 4            -0.3696970                -0.600            -0.1666667
## 5            -0.2201923                -0.500            -0.0500000
## 6            -0.1950000                -0.400            -0.1000000
##   title_subjectivity title_sentiment_polarity abs_title_subjectivity
## 1          0.5000000               -0.1875000             0.00000000
## 2          0.0000000                0.0000000             0.50000000
## 3          0.0000000                0.0000000             0.50000000
## 4          0.0000000                0.0000000             0.50000000
## 5          0.4545455                0.1363636             0.04545455
## 6          0.6428571                0.2142857             0.14285714
##   abs_title_sentiment_polarity shares
## 1                    0.1875000    593
## 2                    0.0000000    711
## 3                    0.0000000   1500
## 4                    0.0000000   1200
## 5                    0.1363636    505
## 6                    0.2142857    855

데이터 탐색을 해보았습니다. 그런데 Max값에 특이치가 보입니다. n_unique_tokens나 n_unique_tokens은 rate라서 0에서 1사이의 값을 가져야 하는데 Max 701, 1042 등으로 나타나고 있습니다. 데이터 전처리과정에서 처리해야 할 값입니다.

summary(news)
##                                                              url       
##  http://mashable.com/2013/01/07/amazon-instant-video-browser/  :    1  
##  http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/   :    1  
##  http://mashable.com/2013/01/07/apple-40-billion-app-downloads/:    1  
##  http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/      :    1  
##  http://mashable.com/2013/01/07/att-u-verse-apps/              :    1  
##  http://mashable.com/2013/01/07/beewi-smart-toys/              :    1  
##  (Other)                                                       :39638  
##    timedelta     n_tokens_title n_tokens_content n_unique_tokens   
##  Min.   :  8.0   Min.   : 2.0   Min.   :   0.0   Min.   :  0.0000  
##  1st Qu.:164.0   1st Qu.: 9.0   1st Qu.: 246.0   1st Qu.:  0.4709  
##  Median :339.0   Median :10.0   Median : 409.0   Median :  0.5392  
##  Mean   :354.5   Mean   :10.4   Mean   : 546.5   Mean   :  0.5482  
##  3rd Qu.:542.0   3rd Qu.:12.0   3rd Qu.: 716.0   3rd Qu.:  0.6087  
##  Max.   :731.0   Max.   :23.0   Max.   :8474.0   Max.   :701.0000  
##                                                                    
##  n_non_stop_words    n_non_stop_unique_tokens   num_hrefs     
##  Min.   :   0.0000   Min.   :  0.0000         Min.   :  0.00  
##  1st Qu.:   1.0000   1st Qu.:  0.6257         1st Qu.:  4.00  
##  Median :   1.0000   Median :  0.6905         Median :  8.00  
##  Mean   :   0.9965   Mean   :  0.6892         Mean   : 10.88  
##  3rd Qu.:   1.0000   3rd Qu.:  0.7546         3rd Qu.: 14.00  
##  Max.   :1042.0000   Max.   :650.0000         Max.   :304.00  
##                                                               
##  num_self_hrefs       num_imgs         num_videos    average_token_length
##  Min.   :  0.000   Min.   :  0.000   Min.   : 0.00   Min.   :0.000       
##  1st Qu.:  1.000   1st Qu.:  1.000   1st Qu.: 0.00   1st Qu.:4.478       
##  Median :  3.000   Median :  1.000   Median : 0.00   Median :4.664       
##  Mean   :  3.294   Mean   :  4.544   Mean   : 1.25   Mean   :4.548       
##  3rd Qu.:  4.000   3rd Qu.:  4.000   3rd Qu.: 1.00   3rd Qu.:4.855       
##  Max.   :116.000   Max.   :128.000   Max.   :91.00   Max.   :8.042       
##                                                                          
##   num_keywords    data_channel_is_lifestyle data_channel_is_entertainment
##  Min.   : 1.000   Min.   :0.00000           Min.   :0.000                
##  1st Qu.: 6.000   1st Qu.:0.00000           1st Qu.:0.000                
##  Median : 7.000   Median :0.00000           Median :0.000                
##  Mean   : 7.224   Mean   :0.05295           Mean   :0.178                
##  3rd Qu.: 9.000   3rd Qu.:0.00000           3rd Qu.:0.000                
##  Max.   :10.000   Max.   :1.00000           Max.   :1.000                
##                                                                          
##  data_channel_is_bus data_channel_is_socmed data_channel_is_tech
##  Min.   :0.0000      Min.   :0.0000         Min.   :0.0000      
##  1st Qu.:0.0000      1st Qu.:0.0000         1st Qu.:0.0000      
##  Median :0.0000      Median :0.0000         Median :0.0000      
##  Mean   :0.1579      Mean   :0.0586         Mean   :0.1853      
##  3rd Qu.:0.0000      3rd Qu.:0.0000         3rd Qu.:0.0000      
##  Max.   :1.0000      Max.   :1.0000         Max.   :1.0000      
##                                                                 
##  data_channel_is_world   kw_min_min       kw_max_min       kw_avg_min     
##  Min.   :0.0000        Min.   : -1.00   Min.   :     0   Min.   :   -1.0  
##  1st Qu.:0.0000        1st Qu.: -1.00   1st Qu.:   445   1st Qu.:  141.8  
##  Median :0.0000        Median : -1.00   Median :   660   Median :  235.5  
##  Mean   :0.2126        Mean   : 26.11   Mean   :  1154   Mean   :  312.4  
##  3rd Qu.:0.0000        3rd Qu.:  4.00   3rd Qu.:  1000   3rd Qu.:  357.0  
##  Max.   :1.0000        Max.   :377.00   Max.   :298400   Max.   :42827.9  
##                                                                           
##    kw_min_max       kw_max_max       kw_avg_max       kw_min_avg  
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :  -1  
##  1st Qu.:     0   1st Qu.:843300   1st Qu.:172847   1st Qu.:   0  
##  Median :  1400   Median :843300   Median :244572   Median :1024  
##  Mean   : 13612   Mean   :752324   Mean   :259282   Mean   :1117  
##  3rd Qu.:  7900   3rd Qu.:843300   3rd Qu.:330980   3rd Qu.:2057  
##  Max.   :843300   Max.   :843300   Max.   :843300   Max.   :3613  
##                                                                   
##    kw_max_avg       kw_avg_avg    self_reference_min_shares
##  Min.   :     0   Min.   :    0   Min.   :     0           
##  1st Qu.:  3562   1st Qu.: 2382   1st Qu.:   639           
##  Median :  4356   Median : 2870   Median :  1200           
##  Mean   :  5657   Mean   : 3136   Mean   :  3999           
##  3rd Qu.:  6020   3rd Qu.: 3600   3rd Qu.:  2600           
##  Max.   :298400   Max.   :43568   Max.   :843300           
##                                                            
##  self_reference_max_shares self_reference_avg_sharess weekday_is_monday
##  Min.   :     0            Min.   :     0.0           Min.   :0.000    
##  1st Qu.:  1100            1st Qu.:   981.2           1st Qu.:0.000    
##  Median :  2800            Median :  2200.0           Median :0.000    
##  Mean   : 10329            Mean   :  6401.7           Mean   :0.168    
##  3rd Qu.:  8000            3rd Qu.:  5200.0           3rd Qu.:0.000    
##  Max.   :843300            Max.   :843300.0           Max.   :1.000    
##                                                                        
##  weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
##  Min.   :0.0000     Min.   :0.0000       Min.   :0.0000     
##  1st Qu.:0.0000     1st Qu.:0.0000       1st Qu.:0.0000     
##  Median :0.0000     Median :0.0000       Median :0.0000     
##  Mean   :0.1864     Mean   :0.1875       Mean   :0.1833     
##  3rd Qu.:0.0000     3rd Qu.:0.0000       3rd Qu.:0.0000     
##  Max.   :1.0000     Max.   :1.0000       Max.   :1.0000     
##                                                             
##  weekday_is_friday weekday_is_saturday weekday_is_sunday   is_weekend    
##  Min.   :0.0000    Min.   :0.00000     Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000    1st Qu.:0.00000     1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000    Median :0.00000     Median :0.00000   Median :0.0000  
##  Mean   :0.1438    Mean   :0.06188     Mean   :0.06904   Mean   :0.1309  
##  3rd Qu.:0.0000    3rd Qu.:0.00000     3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000    Max.   :1.00000     Max.   :1.00000   Max.   :1.0000  
##                                                                          
##      LDA_00            LDA_01            LDA_02            LDA_03       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.02505   1st Qu.:0.02501   1st Qu.:0.02857   1st Qu.:0.02857  
##  Median :0.03339   Median :0.03334   Median :0.04000   Median :0.04000  
##  Mean   :0.18460   Mean   :0.14126   Mean   :0.21632   Mean   :0.22377  
##  3rd Qu.:0.24096   3rd Qu.:0.15083   3rd Qu.:0.33422   3rd Qu.:0.37576  
##  Max.   :0.92699   Max.   :0.92595   Max.   :0.92000   Max.   :0.92653  
##                                                                         
##      LDA_04        global_subjectivity global_sentiment_polarity
##  Min.   :0.00000   Min.   :0.0000      Min.   :-0.39375         
##  1st Qu.:0.02857   1st Qu.:0.3962      1st Qu.: 0.05776         
##  Median :0.04073   Median :0.4535      Median : 0.11912         
##  Mean   :0.23403   Mean   :0.4434      Mean   : 0.11931         
##  3rd Qu.:0.39999   3rd Qu.:0.5083      3rd Qu.: 0.17783         
##  Max.   :0.92719   Max.   :1.0000      Max.   : 0.72784         
##                                                                 
##  global_rate_positive_words global_rate_negative_words rate_positive_words
##  Min.   :0.00000            Min.   :0.000000           Min.   :0.0000     
##  1st Qu.:0.02838            1st Qu.:0.009615           1st Qu.:0.6000     
##  Median :0.03902            Median :0.015337           Median :0.7105     
##  Mean   :0.03962            Mean   :0.016612           Mean   :0.6822     
##  3rd Qu.:0.05028            3rd Qu.:0.021739           3rd Qu.:0.8000     
##  Max.   :0.15549            Max.   :0.184932           Max.   :1.0000     
##                                                                           
##  rate_negative_words avg_positive_polarity min_positive_polarity
##  Min.   :0.0000      Min.   :0.0000        Min.   :0.00000      
##  1st Qu.:0.1852      1st Qu.:0.3062        1st Qu.:0.05000      
##  Median :0.2800      Median :0.3588        Median :0.10000      
##  Mean   :0.2879      Mean   :0.3538        Mean   :0.09545      
##  3rd Qu.:0.3846      3rd Qu.:0.4114        3rd Qu.:0.10000      
##  Max.   :1.0000      Max.   :1.0000        Max.   :1.00000      
##                                                                 
##  max_positive_polarity avg_negative_polarity min_negative_polarity
##  Min.   :0.0000        Min.   :-1.0000       Min.   :-1.0000      
##  1st Qu.:0.6000        1st Qu.:-0.3284       1st Qu.:-0.7000      
##  Median :0.8000        Median :-0.2533       Median :-0.5000      
##  Mean   :0.7567        Mean   :-0.2595       Mean   :-0.5219      
##  3rd Qu.:1.0000        3rd Qu.:-0.1869       3rd Qu.:-0.3000      
##  Max.   :1.0000        Max.   : 0.0000       Max.   : 0.0000      
##                                                                   
##  max_negative_polarity title_subjectivity title_sentiment_polarity
##  Min.   :-1.0000       Min.   :0.0000     Min.   :-1.00000        
##  1st Qu.:-0.1250       1st Qu.:0.0000     1st Qu.: 0.00000        
##  Median :-0.1000       Median :0.1500     Median : 0.00000        
##  Mean   :-0.1075       Mean   :0.2824     Mean   : 0.07143        
##  3rd Qu.:-0.0500       3rd Qu.:0.5000     3rd Qu.: 0.15000        
##  Max.   : 0.0000       Max.   :1.0000     Max.   : 1.00000        
##                                                                   
##  abs_title_subjectivity abs_title_sentiment_polarity     shares      
##  Min.   :0.0000         Min.   :0.0000               Min.   :     1  
##  1st Qu.:0.1667         1st Qu.:0.0000               1st Qu.:   946  
##  Median :0.5000         Median :0.0000               Median :  1400  
##  Mean   :0.3418         Mean   :0.1561               Mean   :  3395  
##  3rd Qu.:0.5000         3rd Qu.:0.2500               3rd Qu.:  2800  
##  Max.   :0.5000         Max.   :1.0000               Max.   :843300  
## 

4. 데이터전처리

4-1. 데이터클리닝

아까 발견했던 이상치를 찾기 위해 n_unique_tokens이 1보다 큰 케이스를 찾아보았습니다. 31038번째 케이스 하나뿐인 것으로 발견되어 해당 행은 데이터에서 제거하였습니다.

# 데이터 전처리 : 특이치 제거
subset(news, n_unique_tokens>1)
##                                                                    url
## 31038 http://mashable.com/2014/08/18/ukraine-civilian-convoy-attacked/
##       timedelta n_tokens_title n_tokens_content n_unique_tokens
## 31038       142              9             1570             701
##       n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs
## 31038             1042                      650        11             10
##       num_imgs num_videos average_token_length num_keywords
## 31038       51          0             4.696178            7
##       data_channel_is_lifestyle data_channel_is_entertainment
## 31038                         0                             1
##       data_channel_is_bus data_channel_is_socmed data_channel_is_tech
## 31038                   0                      0                    0
##       data_channel_is_world kw_min_min kw_max_min kw_avg_min kw_min_max
## 31038                     0         -1        778   143.7143      23100
##       kw_max_max kw_avg_max kw_min_avg kw_max_avg kw_avg_avg
## 31038     843300   330442.9   2420.579   3490.599   2912.105
##       self_reference_min_shares self_reference_max_shares
## 31038                       795                         0
##       self_reference_avg_sharess weekday_is_monday weekday_is_tuesday
## 31038                   6924.375                 0                  1
##       weekday_is_wednesday weekday_is_thursday weekday_is_friday
## 31038                    0                   0                 0
##       weekday_is_saturday weekday_is_sunday is_weekend LDA_00 LDA_01
## 31038                   0                 0          0      0      0
##       LDA_02 LDA_03 LDA_04 global_subjectivity global_sentiment_polarity
## 31038      0      0      0                   0                         0
##       global_rate_positive_words global_rate_negative_words
## 31038                          0                          0
##       rate_positive_words rate_negative_words avg_positive_polarity
## 31038                   0                   0                     0
##       min_positive_polarity max_positive_polarity avg_negative_polarity
## 31038                     0                     0                     0
##       min_negative_polarity max_negative_polarity title_subjectivity
## 31038                     0                     0                  0
##       title_sentiment_polarity abs_title_subjectivity
## 31038                        0                      0
##       abs_title_sentiment_polarity shares
## 31038                            0   5900
news <- news[-31038,]

4-2. 변수변환

이번 분석은 뉴스의 공유수를 예측하기 위한 것이 아니라 인기가 있는지 없는지 분류하기 위한 것입니다. 따라서 공유수를 기존 shares변수에서 popular변수로 변환하였습니다. shares 중위수보다 많으면 1(popular), 같거나 적으면 0(unpopular)로 분류했습니다.

# 데이터 전처리 : 변수변환(shares -> popular)
news$popular <- factor(as.numeric(news$shares>median(news$shares)),levels=c(0,1))

4-3. factor 변수 인식

이 데이터셋에는 명목형 변수가 있습니다. ‘data_channel_is’로 시작하는 변수 6개는 기사가 발행된 채널의 성격에 대해 나타내고 있습니다. ’weekday_is_’로 시작되는 변수와 ’is_weekend’ 변수 등 8개 변수는 기사 발행 요일을 나타내고 있습니다. 따라서 이들은 factor 변수로 인식시키는 변환을 하였습니다.

# change 'numeric' into 'factor'
news$data_channel_is_lifestyle <- as.factor(news$data_channel_is_lifestyle)
news$data_channel_is_entertainment <- as.factor(news$data_channel_is_entertainment)
news$data_channel_is_bus <- as.factor(news$data_channel_is_bus)
news$data_channel_is_socmed <- as.factor(news$data_channel_is_socmed)
news$data_channel_is_tech <- as.factor(news$data_channel_is_tech)
news$data_channel_is_world <- as.factor(news$data_channel_is_world)
news$weekday_is_monday <- as.factor(news$weekday_is_monday)
news$weekday_is_tuesday <- as.factor(news$weekday_is_tuesday)
news$weekday_is_wednesday <- as.factor(news$weekday_is_wednes)
news$weekday_is_thursday <- as.factor(news$weekday_is_thursday)
news$weekday_is_friday <- as.factor(news$weekday_is_friday)
news$weekday_is_saturday <- as.factor(news$weekday_is_saturday)
news$weekday_is_sunday <- as.factor(news$weekday_is_sunday)
news$is_weekend <- as.factor(news$is_weekend)
# 데이터 다시 확인
summary(news)
##                                                              url       
##  http://mashable.com/2013/01/07/amazon-instant-video-browser/  :    1  
##  http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/   :    1  
##  http://mashable.com/2013/01/07/apple-40-billion-app-downloads/:    1  
##  http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/      :    1  
##  http://mashable.com/2013/01/07/att-u-verse-apps/              :    1  
##  http://mashable.com/2013/01/07/beewi-smart-toys/              :    1  
##  (Other)                                                       :39637  
##    timedelta     n_tokens_title n_tokens_content n_unique_tokens 
##  Min.   :  8.0   Min.   : 2.0   Min.   :   0.0   Min.   :0.0000  
##  1st Qu.:164.0   1st Qu.: 9.0   1st Qu.: 246.0   1st Qu.:0.4709  
##  Median :339.0   Median :10.0   Median : 409.0   Median :0.5392  
##  Mean   :354.5   Mean   :10.4   Mean   : 546.5   Mean   :0.5305  
##  3rd Qu.:542.0   3rd Qu.:12.0   3rd Qu.: 716.0   3rd Qu.:0.6087  
##  Max.   :731.0   Max.   :23.0   Max.   :8474.0   Max.   :1.0000  
##                                                                  
##  n_non_stop_words n_non_stop_unique_tokens   num_hrefs     
##  Min.   :0.0000   Min.   :0.0000           Min.   :  0.00  
##  1st Qu.:1.0000   1st Qu.:0.6257           1st Qu.:  4.00  
##  Median :1.0000   Median :0.6905           Median :  8.00  
##  Mean   :0.9702   Mean   :0.6728           Mean   : 10.88  
##  3rd Qu.:1.0000   3rd Qu.:0.7546           3rd Qu.: 14.00  
##  Max.   :1.0000   Max.   :1.0000           Max.   :304.00  
##                                                            
##  num_self_hrefs       num_imgs         num_videos    average_token_length
##  Min.   :  0.000   Min.   :  0.000   Min.   : 0.00   Min.   :0.000       
##  1st Qu.:  1.000   1st Qu.:  1.000   1st Qu.: 0.00   1st Qu.:4.478       
##  Median :  3.000   Median :  1.000   Median : 0.00   Median :4.664       
##  Mean   :  3.293   Mean   :  4.543   Mean   : 1.25   Mean   :4.548       
##  3rd Qu.:  4.000   3rd Qu.:  4.000   3rd Qu.: 1.00   3rd Qu.:4.855       
##  Max.   :116.000   Max.   :128.000   Max.   :91.00   Max.   :8.042       
##                                                                          
##   num_keywords    data_channel_is_lifestyle data_channel_is_entertainment
##  Min.   : 1.000   0:37544                   0:32587                      
##  1st Qu.: 6.000   1: 2099                   1: 7056                      
##  Median : 7.000                                                          
##  Mean   : 7.224                                                          
##  3rd Qu.: 9.000                                                          
##  Max.   :10.000                                                          
##                                                                          
##  data_channel_is_bus data_channel_is_socmed data_channel_is_tech
##  0:33385             0:37320                0:32297             
##  1: 6258             1: 2323                1: 7346             
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##  data_channel_is_world   kw_min_min       kw_max_min       kw_avg_min     
##  0:31216               Min.   : -1.00   Min.   :     0   Min.   :   -1.0  
##  1: 8427               1st Qu.: -1.00   1st Qu.:   445   1st Qu.:  141.8  
##                        Median : -1.00   Median :   660   Median :  235.5  
##                        Mean   : 26.11   Mean   :  1154   Mean   :  312.4  
##                        3rd Qu.:  4.00   3rd Qu.:  1000   3rd Qu.:  357.0  
##                        Max.   :377.00   Max.   :298400   Max.   :42827.9  
##                                                                           
##    kw_min_max       kw_max_max       kw_avg_max       kw_min_avg  
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :  -1  
##  1st Qu.:     0   1st Qu.:843300   1st Qu.:172844   1st Qu.:   0  
##  Median :  1400   Median :843300   Median :244567   Median :1024  
##  Mean   : 13612   Mean   :752322   Mean   :259280   Mean   :1117  
##  3rd Qu.:  7900   3rd Qu.:843300   3rd Qu.:330980   3rd Qu.:2057  
##  Max.   :843300   Max.   :843300   Max.   :843300   Max.   :3613  
##                                                                   
##    kw_max_avg       kw_avg_avg    self_reference_min_shares
##  Min.   :     0   Min.   :    0   Min.   :     0           
##  1st Qu.:  3562   1st Qu.: 2382   1st Qu.:   639           
##  Median :  4356   Median : 2870   Median :  1200           
##  Mean   :  5657   Mean   : 3136   Mean   :  3999           
##  3rd Qu.:  6020   3rd Qu.: 3600   3rd Qu.:  2600           
##  Max.   :298400   Max.   :43568   Max.   :843300           
##                                                            
##  self_reference_max_shares self_reference_avg_sharess weekday_is_monday
##  Min.   :     0            Min.   :     0.0           0:32982          
##  1st Qu.:  1100            1st Qu.:   981.1           1: 6661          
##  Median :  2800            Median :  2200.0                            
##  Mean   : 10330            Mean   :  6401.7                            
##  3rd Qu.:  8000            3rd Qu.:  5200.0                            
##  Max.   :843300            Max.   :843300.0                            
##                                                                        
##  weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
##  0:32254            0:32208              0:32376            
##  1: 7389            1: 7435              1: 7267            
##                                                             
##                                                             
##                                                             
##                                                             
##                                                             
##  weekday_is_friday weekday_is_saturday weekday_is_sunday is_weekend
##  0:33942           0:37190             0:36906           0:34453   
##  1: 5701           1: 2453             1: 2737           1: 5190   
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##      LDA_00            LDA_01            LDA_02            LDA_03       
##  Min.   :0.01818   Min.   :0.01818   Min.   :0.01818   Min.   :0.01818  
##  1st Qu.:0.02505   1st Qu.:0.02501   1st Qu.:0.02857   1st Qu.:0.02857  
##  Median :0.03339   Median :0.03334   Median :0.04000   Median :0.04000  
##  Mean   :0.18460   Mean   :0.14126   Mean   :0.21633   Mean   :0.22378  
##  3rd Qu.:0.24097   3rd Qu.:0.15084   3rd Qu.:0.33422   3rd Qu.:0.37578  
##  Max.   :0.92699   Max.   :0.92595   Max.   :0.92000   Max.   :0.92653  
##                                                                         
##      LDA_04        global_subjectivity global_sentiment_polarity
##  Min.   :0.01818   Min.   :0.0000      Min.   :-0.39375         
##  1st Qu.:0.02857   1st Qu.:0.3962      1st Qu.: 0.05776         
##  Median :0.04073   Median :0.4535      Median : 0.11912         
##  Mean   :0.23404   Mean   :0.4434      Mean   : 0.11931         
##  3rd Qu.:0.39999   3rd Qu.:0.5083      3rd Qu.: 0.17784         
##  Max.   :0.92719   Max.   :1.0000      Max.   : 0.72784         
##                                                                 
##  global_rate_positive_words global_rate_negative_words rate_positive_words
##  Min.   :0.00000            Min.   :0.000000           Min.   :0.0000     
##  1st Qu.:0.02839            1st Qu.:0.009615           1st Qu.:0.6000     
##  Median :0.03902            Median :0.015337           Median :0.7105     
##  Mean   :0.03963            Mean   :0.016613           Mean   :0.6822     
##  3rd Qu.:0.05028            3rd Qu.:0.021739           3rd Qu.:0.8000     
##  Max.   :0.15549            Max.   :0.184932           Max.   :1.0000     
##                                                                           
##  rate_negative_words avg_positive_polarity min_positive_polarity
##  Min.   :0.0000      Min.   :0.0000        Min.   :0.00000      
##  1st Qu.:0.1852      1st Qu.:0.3062        1st Qu.:0.05000      
##  Median :0.2800      Median :0.3588        Median :0.10000      
##  Mean   :0.2879      Mean   :0.3538        Mean   :0.09545      
##  3rd Qu.:0.3846      3rd Qu.:0.4114        3rd Qu.:0.10000      
##  Max.   :1.0000      Max.   :1.0000        Max.   :1.00000      
##                                                                 
##  max_positive_polarity avg_negative_polarity min_negative_polarity
##  Min.   :0.0000        Min.   :-1.0000       Min.   :-1.000       
##  1st Qu.:0.6000        1st Qu.:-0.3284       1st Qu.:-0.700       
##  Median :0.8000        Median :-0.2533       Median :-0.500       
##  Mean   :0.7567        Mean   :-0.2595       Mean   :-0.522       
##  3rd Qu.:1.0000        3rd Qu.:-0.1869       3rd Qu.:-0.300       
##  Max.   :1.0000        Max.   : 0.0000       Max.   : 0.000       
##                                                                   
##  max_negative_polarity title_subjectivity title_sentiment_polarity
##  Min.   :-1.0000       Min.   :0.0000     Min.   :-1.00000        
##  1st Qu.:-0.1250       1st Qu.:0.0000     1st Qu.: 0.00000        
##  Median :-0.1000       Median :0.1500     Median : 0.00000        
##  Mean   :-0.1075       Mean   :0.2824     Mean   : 0.07143        
##  3rd Qu.:-0.0500       3rd Qu.:0.5000     3rd Qu.: 0.15000        
##  Max.   : 0.0000       Max.   :1.0000     Max.   : 1.00000        
##                                                                   
##  abs_title_subjectivity abs_title_sentiment_polarity     shares      
##  Min.   :0.0000         Min.   :0.0000               Min.   :     1  
##  1st Qu.:0.1667         1st Qu.:0.0000               1st Qu.:   946  
##  Median :0.5000         Median :0.0000               Median :  1400  
##  Mean   :0.3419         Mean   :0.1561               Mean   :  3395  
##  3rd Qu.:0.5000         3rd Qu.:0.2500               3rd Qu.:  2800  
##  Max.   :0.5000         Max.   :1.0000               Max.   :843300  
##                                                                      
##  popular  
##  0:20082  
##  1:19561  
##           
##           
##           
##           
## 

5. Data Set Split

모델 생성에 앞서 Training : Test 데이터를 각각 50:50 비율로 분할하였습니다.

# Data Set Split
n <- nrow(news)
set.seed(181215)
index.train <- sort(sample(1:n,n/2,replace=FALSE))
news.train <- news[index.train,]
news.test <- news[-index.train,]

6. Random Forest 모형 생성

이번 분석에서는 다수의 의사결정 나무를 결합하여 하나의 모형을 생성하는 Random Forest 방법을 사용하겠습니다. Random Forest는 의사결정 나무의 단점을 개선하고 예측력이 높다는 점이 장점이며, 특히 이번 데이터는 변수의 갯수가 상당히 많은 편이어서 Random Forest를 사용하는 것이 유리할 것으로 판단하였습니다. popular를 분류할 수 있는 Random Forest모형을 생성했습니다. shares변수와 url 변수를 제외하고 모든 변수를 넣어 생성했습니다.

#Random Forest 모형 생성
set.seed(181215)
news.rf1 <- randomForest(popular ~ .-shares - url , news.train)

랜덤 포레스트 모델에서의 변수 중요도(변수 설명력이 높은 정도)를 확인해보겠습니다.

# 확인
news.rf1
## 
## Call:
##  randomForest(formula = popular ~ . - shares - url, data = news.train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 33.76%
## Confusion matrix:
##      0    1 class.error
## 0 6742 3350   0.3319461
## 1 3342 6387   0.3435091
importance(news.rf1)
##                               MeanDecreaseGini
## timedelta                            290.84410
## n_tokens_title                       149.60065
## n_tokens_content                     229.15671
## n_unique_tokens                      253.33979
## n_non_stop_words                     226.58606
## n_non_stop_unique_tokens             257.66839
## num_hrefs                            196.92186
## num_self_hrefs                       126.52540
## num_imgs                             137.01156
## num_videos                            76.12201
## average_token_length                 255.23249
## num_keywords                          99.28233
## data_channel_is_lifestyle             14.35835
## data_channel_is_entertainment         79.33210
## data_channel_is_bus                   18.76879
## data_channel_is_socmed                39.86431
## data_channel_is_tech                  46.44902
## data_channel_is_world                 64.40557
## kw_min_min                            46.08973
## kw_max_min                           244.64755
## kw_avg_min                           272.39033
## kw_min_max                           164.88279
## kw_max_max                            57.20365
## kw_avg_max                           277.41767
## kw_min_avg                           233.48924
## kw_max_avg                           373.38827
## kw_avg_avg                           418.31102
## self_reference_min_shares            283.21360
## self_reference_max_shares            218.30834
## self_reference_avg_sharess           260.54455
## weekday_is_monday                     23.93554
## weekday_is_tuesday                    25.59525
## weekday_is_wednesday                  28.65229
## weekday_is_thursday                   24.34554
## weekday_is_friday                     23.34009
## weekday_is_saturday                   35.58614
## weekday_is_sunday                     24.64007
## is_weekend                            88.45346
## LDA_00                               274.86531
## LDA_01                               280.01828
## LDA_02                               328.64412
## LDA_03                               255.89929
## LDA_04                               279.41050
## global_subjectivity                  261.68133
## global_sentiment_polarity            233.11521
## global_rate_positive_words           240.84038
## global_rate_negative_words           220.01637
## rate_positive_words                  193.89371
## rate_negative_words                  193.41509
## avg_positive_polarity                240.73241
## min_positive_polarity                120.64513
## max_positive_polarity                 97.92928
## avg_negative_polarity                225.78484
## min_negative_polarity                137.89751
## max_negative_polarity                133.60407
## title_subjectivity                   123.93128
## title_sentiment_polarity             143.99882
## abs_title_subjectivity               114.95865
## abs_title_sentiment_polarity         115.61161
# 시각적으로 확인
varImpPlot(news.rf1)

이제 news.train 데이터셋에서 생성한 모형을 news.test 데이터셋에 적용하여 검증해보았습니다. 전체 19821개의 데이터셋에서 6486개가 오분류되었습니다. (32.7%)

# test 적용

rfpred <- predict(news.rf1, newdata=news.test) 
confusionMatrix(rfpred, news.test$popular)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6667 3211
##          1 3323 6621
##                                           
##                Accuracy : 0.6704          
##                  95% CI : (0.6638, 0.6769)
##     No Information Rate : 0.504           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.3408          
##  Mcnemar's Test P-Value : 0.1697          
##                                           
##             Sensitivity : 0.6674          
##             Specificity : 0.6734          
##          Pos Pred Value : 0.6749          
##          Neg Pred Value : 0.6658          
##              Prevalence : 0.5040          
##          Detection Rate : 0.3363          
##    Detection Prevalence : 0.4983          
##       Balanced Accuracy : 0.6704          
##                                           
##        'Positive' Class : 0               
## 

7. Reference

참고문헌은 다음과 같습니다.