디지털 기술이 발전하면서 기존에 신문이나 방송과 같은 미디어를 기반으로 하던 보도기관들이 온라인 출판에 뛰어들었습니다. 이제 다수 뉴스 콘텐츠 유통이 온라인과 모바일 플랫폼을 중심으로 이루어지고 있습니다. 기사의 발행 건수도 과거에 비해 크게 늘었습니다. 대표적인 포털사이트 ‘네이버’와 ‘다음’이 하루 평균 5만건의 기사를 독자에게 전달하여, 1년으로 따지만 약 1800만 건에 달하고 있습니다. 이렇게 많은 기사 가운데 어떤 기사가 ‘인기있는’ 기사일까? 언론사 입장에서 이 질문은 편집 방향을 결정하거나 광고 수입을 확보하여 안정적 경영을 해나가는 데 있어 매우 중요할 것입니다. 이 보고서에서는 온라인 뉴스의 인기 여부를 분류할 수 있는 모델을 구축해보고자 합니다.
뉴스사이트 Mashable (www.mashable.com)에서 2년간 발행된 모든 기사를 수집한 것입니다. (2013.01.07.~2015.01.07.) 관측치 39797개와 61개 변수로 구성 되어있는 데이터 입니다. 다음 사이트에서 다운로드 받을 수 있습니다.
https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
기사에 대한공유(shares)가 많을수록 인기가 많은 것으로 판단하려고 합니다. 변수에 대한 자세한 설명은 아래와 같습니다.
| 연번 | 변수명 | 변수설명 |
|---|---|---|
| 0 | url | URL of the article (non-predictive) |
| 1 | timedelta | Days between the article publication and the dataset acquisition (non-predictive) |
| 2 | n_tokens_title | Number of words in the title |
| 3 | n_tokens_content | Number of words in the content |
| 4 | n_unique_tokens | Rate of unique words in the content |
| 5 | n_non_stop_words | Rate of non-stop words in the content |
| 6 | n_non_stop_unique_tokens | Rate of unique non-stop words in the content |
| 7 | num_hrefs | Number of links |
| 8 | num_self_hrefs | Number of links to other articles published by Mashable |
| 9 | num_imgs | Number of images |
| 10 | num_videos | Number of videos |
| 11 | average_token_length | Average length of the words in the content |
| 12 | num_keywords | Number of keywords in the metadata |
| 13 | data_channel_is_lifestyle | Is data channel ‘Lifestyle’? |
| 14 | data_channel_is_entertainment | Is data channel ‘Entertainment’? |
| 15 | data_channel_is_bus | Is data channel ‘Business’? |
| 16 | data_channel_is_socmed | Is data channel ‘Social Media’? |
| 17 | data_channel_is_tech | Is data channel ‘Tech’? |
| 18 | data_channel_is_world | Is data channel ‘World’? |
| 19 | kw_min_min | Worst keyword (min |
| 20 | kw_max_min | Worst keyword (max |
| 21 | kw_avg_min | Worst keyword (avg |
| 22 | kw_min_max | Best keyword (min |
| 23 | kw_max_max | Best keyword (max |
| 24 | kw_avg_max | Best keyword (avg |
| 25 | kw_min_avg | Avg |
| 26 | kw_max_avg | Avg |
| 27 | kw_avg_avg | Avg |
| 28 | self_reference_min_shares | Min |
| 29 | self_reference_max_shares | Max |
| 30 | self_reference_avg_sharess | Avg |
| 31 | weekday_is_monday | Was the article published on a Monday? |
| 32 | weekday_is_tuesday | Was the article published on a Tuesday? |
| 33 | weekday_is_wednesday | Was the article published on a Wednesday? |
| 34 | weekday_is_thursday | Was the article published on a Thursday? |
| 35 | weekday_is_friday | Was the article published on a Friday? |
| 36 | weekday_is_saturday | Was the article published on a Saturday? |
| 37 | weekday_is_sunday | Was the article published on a Sunday? |
| 38 | is_weekend | Was the article published on the weekend? |
| 39 | LDA_00 | Closeness to LDA topic 0 |
| 40 | LDA_01 | Closeness to LDA topic 1 |
| 41 | LDA_02 | Closeness to LDA topic 2 |
| 42 | LDA_03 | Closeness to LDA topic 3 |
| 43 | LDA_04 | Closeness to LDA topic 4 |
| 44 | global_subjectivity | Text subjectivity |
| 45 | global_sentiment_polarity | Text sentiment polarity |
| 46 | global_rate_positive_words | Rate of positive words in the content |
| 47 | global_rate_negative_words | Rate of negative words in the content |
| 48 | rate_positive_words | Rate of positive words among non-neutral tokens |
| 49 | rate_negative_words | Rate of negative words among non-neutral tokens |
| 50 | avg_positive_polarity | Avg |
| 51 | min_positive_polarity | Min |
| 52 | max_positive_polarity | Max |
| 53 | avg_negative_polarity | Avg |
| 54 | min_negative_polarity | Min |
| 55 | max_negative_polarity | Max |
| 56 | title_subjectivity | Title subjectivity |
| 57 | title_sentiment_polarity | Title polarity |
| 58 | abs_title_subjectivity | Absolute subjectivity level |
| 59 | abs_title_sentiment_polarity | Absolute polarity level |
| 60 | shares | Number of shares (target) |
# 작업폴더 확인
getwd()
## [1] "C:/Users/e/Documents/R"
# 데이터 불러오기
news <- read.csv("OnlineNewsPopularity.csv",
header = T)
# 패키지 설치
install.packages("dplyr", repos ="http://cran.us.r-project.org")
## Installing package into 'C:/Users/e/Documents/R/win-library/3.5'
## (as 'lib' is unspecified)
## package 'dplyr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\e\AppData\Local\Temp\RtmpKkxJBc\downloaded_packages
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
install.packages("randomForest", repos = "http://cran.us.r-project.org", dependencies = T)
## Installing package into 'C:/Users/e/Documents/R/win-library/3.5'
## (as 'lib' is unspecified)
## package 'randomForest' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\e\AppData\Local\Temp\RtmpKkxJBc\downloaded_packages
library("randomForest")
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
install.packages("caret", repos ="http://cran.us.r-project.org", dependencies = T)
## Installing package into 'C:/Users/e/Documents/R/win-library/3.5'
## (as 'lib' is unspecified)
## package 'caret' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\e\AppData\Local\Temp\RtmpKkxJBc\downloaded_packages
library("caret")
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
# 데이터 확인
head(news)
## url timedelta
## 1 http://mashable.com/2013/01/07/amazon-instant-video-browser/ 731
## 2 http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/ 731
## 3 http://mashable.com/2013/01/07/apple-40-billion-app-downloads/ 731
## 4 http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/ 731
## 5 http://mashable.com/2013/01/07/att-u-verse-apps/ 731
## 6 http://mashable.com/2013/01/07/beewi-smart-toys/ 731
## n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words
## 1 12 219 0.6635945 1
## 2 9 255 0.6047431 1
## 3 9 211 0.5751295 1
## 4 9 531 0.5037879 1
## 5 13 1072 0.4156456 1
## 6 10 370 0.5598886 1
## n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos
## 1 0.8153846 4 2 1 0
## 2 0.7919463 3 1 1 0
## 3 0.6638655 3 1 1 0
## 4 0.6656347 9 0 1 0
## 5 0.5408895 19 19 20 0
## 6 0.6981982 2 2 0 0
## average_token_length num_keywords data_channel_is_lifestyle
## 1 4.680365 5 0
## 2 4.913725 4 0
## 3 4.393365 6 0
## 4 4.404896 7 0
## 5 4.682836 7 0
## 6 4.359459 9 0
## data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed
## 1 1 0 0
## 2 0 1 0
## 3 0 1 0
## 4 1 0 0
## 5 0 0 0
## 6 0 0 0
## data_channel_is_tech data_channel_is_world kw_min_min kw_max_min
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 1 0 0 0
## 6 1 0 0 0
## kw_avg_min kw_min_max kw_max_max kw_avg_max kw_min_avg kw_max_avg
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## kw_avg_avg self_reference_min_shares self_reference_max_shares
## 1 0 496 496
## 2 0 0 0
## 3 0 918 918
## 4 0 0 0
## 5 0 545 16000
## 6 0 8500 8500
## self_reference_avg_sharess weekday_is_monday weekday_is_tuesday
## 1 496.000 1 0
## 2 0.000 1 0
## 3 918.000 1 0
## 4 0.000 1 0
## 5 3151.158 1 0
## 6 8500.000 1 0
## weekday_is_wednesday weekday_is_thursday weekday_is_friday
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## weekday_is_saturday weekday_is_sunday is_weekend LDA_00 LDA_01
## 1 0 0 0 0.50033120 0.37827893
## 2 0 0 0 0.79975569 0.05004668
## 3 0 0 0 0.21779229 0.03333446
## 4 0 0 0 0.02857322 0.41929964
## 5 0 0 0 0.02863281 0.02879355
## 6 0 0 0 0.02224528 0.30671758
## LDA_02 LDA_03 LDA_04 global_subjectivity
## 1 0.04000468 0.04126265 0.04012254 0.5216171
## 2 0.05009625 0.05010067 0.05000071 0.3412458
## 3 0.03335142 0.03333354 0.68218829 0.7022222
## 4 0.49465083 0.02890472 0.02857160 0.4298497
## 5 0.02857518 0.02857168 0.88542678 0.5135021
## 6 0.02223128 0.02222429 0.62658158 0.4374086
## global_sentiment_polarity global_rate_positive_words
## 1 0.09256198 0.04566210
## 2 0.14894781 0.04313725
## 3 0.32333333 0.05687204
## 4 0.10070467 0.04143126
## 5 0.28100348 0.07462687
## 6 0.07118419 0.02972973
## global_rate_negative_words rate_positive_words rate_negative_words
## 1 0.013698630 0.7692308 0.2307692
## 2 0.015686275 0.7333333 0.2666667
## 3 0.009478673 0.8571429 0.1428571
## 4 0.020715631 0.6666667 0.3333333
## 5 0.012126866 0.8602151 0.1397849
## 6 0.027027027 0.5238095 0.4761905
## avg_positive_polarity min_positive_polarity max_positive_polarity
## 1 0.3786364 0.10000000 0.7
## 2 0.2869146 0.03333333 0.7
## 3 0.4958333 0.10000000 1.0
## 4 0.3859652 0.13636364 0.8
## 5 0.4111274 0.03333333 1.0
## 6 0.3506100 0.13636364 0.6
## avg_negative_polarity min_negative_polarity max_negative_polarity
## 1 -0.3500000 -0.600 -0.2000000
## 2 -0.1187500 -0.125 -0.1000000
## 3 -0.4666667 -0.800 -0.1333333
## 4 -0.3696970 -0.600 -0.1666667
## 5 -0.2201923 -0.500 -0.0500000
## 6 -0.1950000 -0.400 -0.1000000
## title_subjectivity title_sentiment_polarity abs_title_subjectivity
## 1 0.5000000 -0.1875000 0.00000000
## 2 0.0000000 0.0000000 0.50000000
## 3 0.0000000 0.0000000 0.50000000
## 4 0.0000000 0.0000000 0.50000000
## 5 0.4545455 0.1363636 0.04545455
## 6 0.6428571 0.2142857 0.14285714
## abs_title_sentiment_polarity shares
## 1 0.1875000 593
## 2 0.0000000 711
## 3 0.0000000 1500
## 4 0.0000000 1200
## 5 0.1363636 505
## 6 0.2142857 855
데이터 탐색을 해보았습니다. 그런데 Max값에 특이치가 보입니다. n_unique_tokens나 n_unique_tokens은 rate라서 0에서 1사이의 값을 가져야 하는데 Max 701, 1042 등으로 나타나고 있습니다. 데이터 전처리과정에서 처리해야 할 값입니다.
summary(news)
## url
## http://mashable.com/2013/01/07/amazon-instant-video-browser/ : 1
## http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/ : 1
## http://mashable.com/2013/01/07/apple-40-billion-app-downloads/: 1
## http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/ : 1
## http://mashable.com/2013/01/07/att-u-verse-apps/ : 1
## http://mashable.com/2013/01/07/beewi-smart-toys/ : 1
## (Other) :39638
## timedelta n_tokens_title n_tokens_content n_unique_tokens
## Min. : 8.0 Min. : 2.0 Min. : 0.0 Min. : 0.0000
## 1st Qu.:164.0 1st Qu.: 9.0 1st Qu.: 246.0 1st Qu.: 0.4709
## Median :339.0 Median :10.0 Median : 409.0 Median : 0.5392
## Mean :354.5 Mean :10.4 Mean : 546.5 Mean : 0.5482
## 3rd Qu.:542.0 3rd Qu.:12.0 3rd Qu.: 716.0 3rd Qu.: 0.6087
## Max. :731.0 Max. :23.0 Max. :8474.0 Max. :701.0000
##
## n_non_stop_words n_non_stop_unique_tokens num_hrefs
## Min. : 0.0000 Min. : 0.0000 Min. : 0.00
## 1st Qu.: 1.0000 1st Qu.: 0.6257 1st Qu.: 4.00
## Median : 1.0000 Median : 0.6905 Median : 8.00
## Mean : 0.9965 Mean : 0.6892 Mean : 10.88
## 3rd Qu.: 1.0000 3rd Qu.: 0.7546 3rd Qu.: 14.00
## Max. :1042.0000 Max. :650.0000 Max. :304.00
##
## num_self_hrefs num_imgs num_videos average_token_length
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. :0.000
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 0.00 1st Qu.:4.478
## Median : 3.000 Median : 1.000 Median : 0.00 Median :4.664
## Mean : 3.294 Mean : 4.544 Mean : 1.25 Mean :4.548
## 3rd Qu.: 4.000 3rd Qu.: 4.000 3rd Qu.: 1.00 3rd Qu.:4.855
## Max. :116.000 Max. :128.000 Max. :91.00 Max. :8.042
##
## num_keywords data_channel_is_lifestyle data_channel_is_entertainment
## Min. : 1.000 Min. :0.00000 Min. :0.000
## 1st Qu.: 6.000 1st Qu.:0.00000 1st Qu.:0.000
## Median : 7.000 Median :0.00000 Median :0.000
## Mean : 7.224 Mean :0.05295 Mean :0.178
## 3rd Qu.: 9.000 3rd Qu.:0.00000 3rd Qu.:0.000
## Max. :10.000 Max. :1.00000 Max. :1.000
##
## data_channel_is_bus data_channel_is_socmed data_channel_is_tech
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1579 Mean :0.0586 Mean :0.1853
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## data_channel_is_world kw_min_min kw_max_min kw_avg_min
## Min. :0.0000 Min. : -1.00 Min. : 0 Min. : -1.0
## 1st Qu.:0.0000 1st Qu.: -1.00 1st Qu.: 445 1st Qu.: 141.8
## Median :0.0000 Median : -1.00 Median : 660 Median : 235.5
## Mean :0.2126 Mean : 26.11 Mean : 1154 Mean : 312.4
## 3rd Qu.:0.0000 3rd Qu.: 4.00 3rd Qu.: 1000 3rd Qu.: 357.0
## Max. :1.0000 Max. :377.00 Max. :298400 Max. :42827.9
##
## kw_min_max kw_max_max kw_avg_max kw_min_avg
## Min. : 0 Min. : 0 Min. : 0 Min. : -1
## 1st Qu.: 0 1st Qu.:843300 1st Qu.:172847 1st Qu.: 0
## Median : 1400 Median :843300 Median :244572 Median :1024
## Mean : 13612 Mean :752324 Mean :259282 Mean :1117
## 3rd Qu.: 7900 3rd Qu.:843300 3rd Qu.:330980 3rd Qu.:2057
## Max. :843300 Max. :843300 Max. :843300 Max. :3613
##
## kw_max_avg kw_avg_avg self_reference_min_shares
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 3562 1st Qu.: 2382 1st Qu.: 639
## Median : 4356 Median : 2870 Median : 1200
## Mean : 5657 Mean : 3136 Mean : 3999
## 3rd Qu.: 6020 3rd Qu.: 3600 3rd Qu.: 2600
## Max. :298400 Max. :43568 Max. :843300
##
## self_reference_max_shares self_reference_avg_sharess weekday_is_monday
## Min. : 0 Min. : 0.0 Min. :0.000
## 1st Qu.: 1100 1st Qu.: 981.2 1st Qu.:0.000
## Median : 2800 Median : 2200.0 Median :0.000
## Mean : 10329 Mean : 6401.7 Mean :0.168
## 3rd Qu.: 8000 3rd Qu.: 5200.0 3rd Qu.:0.000
## Max. :843300 Max. :843300.0 Max. :1.000
##
## weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1864 Mean :0.1875 Mean :0.1833
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## weekday_is_friday weekday_is_saturday weekday_is_sunday is_weekend
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.1438 Mean :0.06188 Mean :0.06904 Mean :0.1309
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000
##
## LDA_00 LDA_01 LDA_02 LDA_03
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.02505 1st Qu.:0.02501 1st Qu.:0.02857 1st Qu.:0.02857
## Median :0.03339 Median :0.03334 Median :0.04000 Median :0.04000
## Mean :0.18460 Mean :0.14126 Mean :0.21632 Mean :0.22377
## 3rd Qu.:0.24096 3rd Qu.:0.15083 3rd Qu.:0.33422 3rd Qu.:0.37576
## Max. :0.92699 Max. :0.92595 Max. :0.92000 Max. :0.92653
##
## LDA_04 global_subjectivity global_sentiment_polarity
## Min. :0.00000 Min. :0.0000 Min. :-0.39375
## 1st Qu.:0.02857 1st Qu.:0.3962 1st Qu.: 0.05776
## Median :0.04073 Median :0.4535 Median : 0.11912
## Mean :0.23403 Mean :0.4434 Mean : 0.11931
## 3rd Qu.:0.39999 3rd Qu.:0.5083 3rd Qu.: 0.17783
## Max. :0.92719 Max. :1.0000 Max. : 0.72784
##
## global_rate_positive_words global_rate_negative_words rate_positive_words
## Min. :0.00000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.02838 1st Qu.:0.009615 1st Qu.:0.6000
## Median :0.03902 Median :0.015337 Median :0.7105
## Mean :0.03962 Mean :0.016612 Mean :0.6822
## 3rd Qu.:0.05028 3rd Qu.:0.021739 3rd Qu.:0.8000
## Max. :0.15549 Max. :0.184932 Max. :1.0000
##
## rate_negative_words avg_positive_polarity min_positive_polarity
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.1852 1st Qu.:0.3062 1st Qu.:0.05000
## Median :0.2800 Median :0.3588 Median :0.10000
## Mean :0.2879 Mean :0.3538 Mean :0.09545
## 3rd Qu.:0.3846 3rd Qu.:0.4114 3rd Qu.:0.10000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
##
## max_positive_polarity avg_negative_polarity min_negative_polarity
## Min. :0.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:0.6000 1st Qu.:-0.3284 1st Qu.:-0.7000
## Median :0.8000 Median :-0.2533 Median :-0.5000
## Mean :0.7567 Mean :-0.2595 Mean :-0.5219
## 3rd Qu.:1.0000 3rd Qu.:-0.1869 3rd Qu.:-0.3000
## Max. :1.0000 Max. : 0.0000 Max. : 0.0000
##
## max_negative_polarity title_subjectivity title_sentiment_polarity
## Min. :-1.0000 Min. :0.0000 Min. :-1.00000
## 1st Qu.:-0.1250 1st Qu.:0.0000 1st Qu.: 0.00000
## Median :-0.1000 Median :0.1500 Median : 0.00000
## Mean :-0.1075 Mean :0.2824 Mean : 0.07143
## 3rd Qu.:-0.0500 3rd Qu.:0.5000 3rd Qu.: 0.15000
## Max. : 0.0000 Max. :1.0000 Max. : 1.00000
##
## abs_title_subjectivity abs_title_sentiment_polarity shares
## Min. :0.0000 Min. :0.0000 Min. : 1
## 1st Qu.:0.1667 1st Qu.:0.0000 1st Qu.: 946
## Median :0.5000 Median :0.0000 Median : 1400
## Mean :0.3418 Mean :0.1561 Mean : 3395
## 3rd Qu.:0.5000 3rd Qu.:0.2500 3rd Qu.: 2800
## Max. :0.5000 Max. :1.0000 Max. :843300
##
아까 발견했던 이상치를 찾기 위해 n_unique_tokens이 1보다 큰 케이스를 찾아보았습니다. 31038번째 케이스 하나뿐인 것으로 발견되어 해당 행은 데이터에서 제거하였습니다.
# 데이터 전처리 : 특이치 제거
subset(news, n_unique_tokens>1)
## url
## 31038 http://mashable.com/2014/08/18/ukraine-civilian-convoy-attacked/
## timedelta n_tokens_title n_tokens_content n_unique_tokens
## 31038 142 9 1570 701
## n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs
## 31038 1042 650 11 10
## num_imgs num_videos average_token_length num_keywords
## 31038 51 0 4.696178 7
## data_channel_is_lifestyle data_channel_is_entertainment
## 31038 0 1
## data_channel_is_bus data_channel_is_socmed data_channel_is_tech
## 31038 0 0 0
## data_channel_is_world kw_min_min kw_max_min kw_avg_min kw_min_max
## 31038 0 -1 778 143.7143 23100
## kw_max_max kw_avg_max kw_min_avg kw_max_avg kw_avg_avg
## 31038 843300 330442.9 2420.579 3490.599 2912.105
## self_reference_min_shares self_reference_max_shares
## 31038 795 0
## self_reference_avg_sharess weekday_is_monday weekday_is_tuesday
## 31038 6924.375 0 1
## weekday_is_wednesday weekday_is_thursday weekday_is_friday
## 31038 0 0 0
## weekday_is_saturday weekday_is_sunday is_weekend LDA_00 LDA_01
## 31038 0 0 0 0 0
## LDA_02 LDA_03 LDA_04 global_subjectivity global_sentiment_polarity
## 31038 0 0 0 0 0
## global_rate_positive_words global_rate_negative_words
## 31038 0 0
## rate_positive_words rate_negative_words avg_positive_polarity
## 31038 0 0 0
## min_positive_polarity max_positive_polarity avg_negative_polarity
## 31038 0 0 0
## min_negative_polarity max_negative_polarity title_subjectivity
## 31038 0 0 0
## title_sentiment_polarity abs_title_subjectivity
## 31038 0 0
## abs_title_sentiment_polarity shares
## 31038 0 5900
news <- news[-31038,]
이번 분석은 뉴스의 공유수를 예측하기 위한 것이 아니라 인기가 있는지 없는지 분류하기 위한 것입니다. 따라서 공유수를 기존 shares변수에서 popular변수로 변환하였습니다. shares 중위수보다 많으면 1(popular), 같거나 적으면 0(unpopular)로 분류했습니다.
# 데이터 전처리 : 변수변환(shares -> popular)
news$popular <- factor(as.numeric(news$shares>median(news$shares)),levels=c(0,1))
이 데이터셋에는 명목형 변수가 있습니다. ‘data_channel_is’로 시작하는 변수 6개는 기사가 발행된 채널의 성격에 대해 나타내고 있습니다. ’weekday_is_’로 시작되는 변수와 ’is_weekend’ 변수 등 8개 변수는 기사 발행 요일을 나타내고 있습니다. 따라서 이들은 factor 변수로 인식시키는 변환을 하였습니다.
# change 'numeric' into 'factor'
news$data_channel_is_lifestyle <- as.factor(news$data_channel_is_lifestyle)
news$data_channel_is_entertainment <- as.factor(news$data_channel_is_entertainment)
news$data_channel_is_bus <- as.factor(news$data_channel_is_bus)
news$data_channel_is_socmed <- as.factor(news$data_channel_is_socmed)
news$data_channel_is_tech <- as.factor(news$data_channel_is_tech)
news$data_channel_is_world <- as.factor(news$data_channel_is_world)
news$weekday_is_monday <- as.factor(news$weekday_is_monday)
news$weekday_is_tuesday <- as.factor(news$weekday_is_tuesday)
news$weekday_is_wednesday <- as.factor(news$weekday_is_wednes)
news$weekday_is_thursday <- as.factor(news$weekday_is_thursday)
news$weekday_is_friday <- as.factor(news$weekday_is_friday)
news$weekday_is_saturday <- as.factor(news$weekday_is_saturday)
news$weekday_is_sunday <- as.factor(news$weekday_is_sunday)
news$is_weekend <- as.factor(news$is_weekend)
# 데이터 다시 확인
summary(news)
## url
## http://mashable.com/2013/01/07/amazon-instant-video-browser/ : 1
## http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/ : 1
## http://mashable.com/2013/01/07/apple-40-billion-app-downloads/: 1
## http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/ : 1
## http://mashable.com/2013/01/07/att-u-verse-apps/ : 1
## http://mashable.com/2013/01/07/beewi-smart-toys/ : 1
## (Other) :39637
## timedelta n_tokens_title n_tokens_content n_unique_tokens
## Min. : 8.0 Min. : 2.0 Min. : 0.0 Min. :0.0000
## 1st Qu.:164.0 1st Qu.: 9.0 1st Qu.: 246.0 1st Qu.:0.4709
## Median :339.0 Median :10.0 Median : 409.0 Median :0.5392
## Mean :354.5 Mean :10.4 Mean : 546.5 Mean :0.5305
## 3rd Qu.:542.0 3rd Qu.:12.0 3rd Qu.: 716.0 3rd Qu.:0.6087
## Max. :731.0 Max. :23.0 Max. :8474.0 Max. :1.0000
##
## n_non_stop_words n_non_stop_unique_tokens num_hrefs
## Min. :0.0000 Min. :0.0000 Min. : 0.00
## 1st Qu.:1.0000 1st Qu.:0.6257 1st Qu.: 4.00
## Median :1.0000 Median :0.6905 Median : 8.00
## Mean :0.9702 Mean :0.6728 Mean : 10.88
## 3rd Qu.:1.0000 3rd Qu.:0.7546 3rd Qu.: 14.00
## Max. :1.0000 Max. :1.0000 Max. :304.00
##
## num_self_hrefs num_imgs num_videos average_token_length
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. :0.000
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 0.00 1st Qu.:4.478
## Median : 3.000 Median : 1.000 Median : 0.00 Median :4.664
## Mean : 3.293 Mean : 4.543 Mean : 1.25 Mean :4.548
## 3rd Qu.: 4.000 3rd Qu.: 4.000 3rd Qu.: 1.00 3rd Qu.:4.855
## Max. :116.000 Max. :128.000 Max. :91.00 Max. :8.042
##
## num_keywords data_channel_is_lifestyle data_channel_is_entertainment
## Min. : 1.000 0:37544 0:32587
## 1st Qu.: 6.000 1: 2099 1: 7056
## Median : 7.000
## Mean : 7.224
## 3rd Qu.: 9.000
## Max. :10.000
##
## data_channel_is_bus data_channel_is_socmed data_channel_is_tech
## 0:33385 0:37320 0:32297
## 1: 6258 1: 2323 1: 7346
##
##
##
##
##
## data_channel_is_world kw_min_min kw_max_min kw_avg_min
## 0:31216 Min. : -1.00 Min. : 0 Min. : -1.0
## 1: 8427 1st Qu.: -1.00 1st Qu.: 445 1st Qu.: 141.8
## Median : -1.00 Median : 660 Median : 235.5
## Mean : 26.11 Mean : 1154 Mean : 312.4
## 3rd Qu.: 4.00 3rd Qu.: 1000 3rd Qu.: 357.0
## Max. :377.00 Max. :298400 Max. :42827.9
##
## kw_min_max kw_max_max kw_avg_max kw_min_avg
## Min. : 0 Min. : 0 Min. : 0 Min. : -1
## 1st Qu.: 0 1st Qu.:843300 1st Qu.:172844 1st Qu.: 0
## Median : 1400 Median :843300 Median :244567 Median :1024
## Mean : 13612 Mean :752322 Mean :259280 Mean :1117
## 3rd Qu.: 7900 3rd Qu.:843300 3rd Qu.:330980 3rd Qu.:2057
## Max. :843300 Max. :843300 Max. :843300 Max. :3613
##
## kw_max_avg kw_avg_avg self_reference_min_shares
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 3562 1st Qu.: 2382 1st Qu.: 639
## Median : 4356 Median : 2870 Median : 1200
## Mean : 5657 Mean : 3136 Mean : 3999
## 3rd Qu.: 6020 3rd Qu.: 3600 3rd Qu.: 2600
## Max. :298400 Max. :43568 Max. :843300
##
## self_reference_max_shares self_reference_avg_sharess weekday_is_monday
## Min. : 0 Min. : 0.0 0:32982
## 1st Qu.: 1100 1st Qu.: 981.1 1: 6661
## Median : 2800 Median : 2200.0
## Mean : 10330 Mean : 6401.7
## 3rd Qu.: 8000 3rd Qu.: 5200.0
## Max. :843300 Max. :843300.0
##
## weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
## 0:32254 0:32208 0:32376
## 1: 7389 1: 7435 1: 7267
##
##
##
##
##
## weekday_is_friday weekday_is_saturday weekday_is_sunday is_weekend
## 0:33942 0:37190 0:36906 0:34453
## 1: 5701 1: 2453 1: 2737 1: 5190
##
##
##
##
##
## LDA_00 LDA_01 LDA_02 LDA_03
## Min. :0.01818 Min. :0.01818 Min. :0.01818 Min. :0.01818
## 1st Qu.:0.02505 1st Qu.:0.02501 1st Qu.:0.02857 1st Qu.:0.02857
## Median :0.03339 Median :0.03334 Median :0.04000 Median :0.04000
## Mean :0.18460 Mean :0.14126 Mean :0.21633 Mean :0.22378
## 3rd Qu.:0.24097 3rd Qu.:0.15084 3rd Qu.:0.33422 3rd Qu.:0.37578
## Max. :0.92699 Max. :0.92595 Max. :0.92000 Max. :0.92653
##
## LDA_04 global_subjectivity global_sentiment_polarity
## Min. :0.01818 Min. :0.0000 Min. :-0.39375
## 1st Qu.:0.02857 1st Qu.:0.3962 1st Qu.: 0.05776
## Median :0.04073 Median :0.4535 Median : 0.11912
## Mean :0.23404 Mean :0.4434 Mean : 0.11931
## 3rd Qu.:0.39999 3rd Qu.:0.5083 3rd Qu.: 0.17784
## Max. :0.92719 Max. :1.0000 Max. : 0.72784
##
## global_rate_positive_words global_rate_negative_words rate_positive_words
## Min. :0.00000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.02839 1st Qu.:0.009615 1st Qu.:0.6000
## Median :0.03902 Median :0.015337 Median :0.7105
## Mean :0.03963 Mean :0.016613 Mean :0.6822
## 3rd Qu.:0.05028 3rd Qu.:0.021739 3rd Qu.:0.8000
## Max. :0.15549 Max. :0.184932 Max. :1.0000
##
## rate_negative_words avg_positive_polarity min_positive_polarity
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.1852 1st Qu.:0.3062 1st Qu.:0.05000
## Median :0.2800 Median :0.3588 Median :0.10000
## Mean :0.2879 Mean :0.3538 Mean :0.09545
## 3rd Qu.:0.3846 3rd Qu.:0.4114 3rd Qu.:0.10000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
##
## max_positive_polarity avg_negative_polarity min_negative_polarity
## Min. :0.0000 Min. :-1.0000 Min. :-1.000
## 1st Qu.:0.6000 1st Qu.:-0.3284 1st Qu.:-0.700
## Median :0.8000 Median :-0.2533 Median :-0.500
## Mean :0.7567 Mean :-0.2595 Mean :-0.522
## 3rd Qu.:1.0000 3rd Qu.:-0.1869 3rd Qu.:-0.300
## Max. :1.0000 Max. : 0.0000 Max. : 0.000
##
## max_negative_polarity title_subjectivity title_sentiment_polarity
## Min. :-1.0000 Min. :0.0000 Min. :-1.00000
## 1st Qu.:-0.1250 1st Qu.:0.0000 1st Qu.: 0.00000
## Median :-0.1000 Median :0.1500 Median : 0.00000
## Mean :-0.1075 Mean :0.2824 Mean : 0.07143
## 3rd Qu.:-0.0500 3rd Qu.:0.5000 3rd Qu.: 0.15000
## Max. : 0.0000 Max. :1.0000 Max. : 1.00000
##
## abs_title_subjectivity abs_title_sentiment_polarity shares
## Min. :0.0000 Min. :0.0000 Min. : 1
## 1st Qu.:0.1667 1st Qu.:0.0000 1st Qu.: 946
## Median :0.5000 Median :0.0000 Median : 1400
## Mean :0.3419 Mean :0.1561 Mean : 3395
## 3rd Qu.:0.5000 3rd Qu.:0.2500 3rd Qu.: 2800
## Max. :0.5000 Max. :1.0000 Max. :843300
##
## popular
## 0:20082
## 1:19561
##
##
##
##
##
모델 생성에 앞서 Training : Test 데이터를 각각 50:50 비율로 분할하였습니다.
# Data Set Split
n <- nrow(news)
set.seed(181215)
index.train <- sort(sample(1:n,n/2,replace=FALSE))
news.train <- news[index.train,]
news.test <- news[-index.train,]
이번 분석에서는 다수의 의사결정 나무를 결합하여 하나의 모형을 생성하는 Random Forest 방법을 사용하겠습니다. Random Forest는 의사결정 나무의 단점을 개선하고 예측력이 높다는 점이 장점이며, 특히 이번 데이터는 변수의 갯수가 상당히 많은 편이어서 Random Forest를 사용하는 것이 유리할 것으로 판단하였습니다. popular를 분류할 수 있는 Random Forest모형을 생성했습니다. shares변수와 url 변수를 제외하고 모든 변수를 넣어 생성했습니다.
#Random Forest 모형 생성
set.seed(181215)
news.rf1 <- randomForest(popular ~ .-shares - url , news.train)
랜덤 포레스트 모델에서의 변수 중요도(변수 설명력이 높은 정도)를 확인해보겠습니다.
# 확인
news.rf1
##
## Call:
## randomForest(formula = popular ~ . - shares - url, data = news.train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 33.76%
## Confusion matrix:
## 0 1 class.error
## 0 6742 3350 0.3319461
## 1 3342 6387 0.3435091
importance(news.rf1)
## MeanDecreaseGini
## timedelta 290.84410
## n_tokens_title 149.60065
## n_tokens_content 229.15671
## n_unique_tokens 253.33979
## n_non_stop_words 226.58606
## n_non_stop_unique_tokens 257.66839
## num_hrefs 196.92186
## num_self_hrefs 126.52540
## num_imgs 137.01156
## num_videos 76.12201
## average_token_length 255.23249
## num_keywords 99.28233
## data_channel_is_lifestyle 14.35835
## data_channel_is_entertainment 79.33210
## data_channel_is_bus 18.76879
## data_channel_is_socmed 39.86431
## data_channel_is_tech 46.44902
## data_channel_is_world 64.40557
## kw_min_min 46.08973
## kw_max_min 244.64755
## kw_avg_min 272.39033
## kw_min_max 164.88279
## kw_max_max 57.20365
## kw_avg_max 277.41767
## kw_min_avg 233.48924
## kw_max_avg 373.38827
## kw_avg_avg 418.31102
## self_reference_min_shares 283.21360
## self_reference_max_shares 218.30834
## self_reference_avg_sharess 260.54455
## weekday_is_monday 23.93554
## weekday_is_tuesday 25.59525
## weekday_is_wednesday 28.65229
## weekday_is_thursday 24.34554
## weekday_is_friday 23.34009
## weekday_is_saturday 35.58614
## weekday_is_sunday 24.64007
## is_weekend 88.45346
## LDA_00 274.86531
## LDA_01 280.01828
## LDA_02 328.64412
## LDA_03 255.89929
## LDA_04 279.41050
## global_subjectivity 261.68133
## global_sentiment_polarity 233.11521
## global_rate_positive_words 240.84038
## global_rate_negative_words 220.01637
## rate_positive_words 193.89371
## rate_negative_words 193.41509
## avg_positive_polarity 240.73241
## min_positive_polarity 120.64513
## max_positive_polarity 97.92928
## avg_negative_polarity 225.78484
## min_negative_polarity 137.89751
## max_negative_polarity 133.60407
## title_subjectivity 123.93128
## title_sentiment_polarity 143.99882
## abs_title_subjectivity 114.95865
## abs_title_sentiment_polarity 115.61161
# 시각적으로 확인
varImpPlot(news.rf1)
이제 news.train 데이터셋에서 생성한 모형을 news.test 데이터셋에 적용하여 검증해보았습니다. 전체 19821개의 데이터셋에서 6486개가 오분류되었습니다. (32.7%)
# test 적용
rfpred <- predict(news.rf1, newdata=news.test)
confusionMatrix(rfpred, news.test$popular)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6667 3211
## 1 3323 6621
##
## Accuracy : 0.6704
## 95% CI : (0.6638, 0.6769)
## No Information Rate : 0.504
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.3408
## Mcnemar's Test P-Value : 0.1697
##
## Sensitivity : 0.6674
## Specificity : 0.6734
## Pos Pred Value : 0.6749
## Neg Pred Value : 0.6658
## Prevalence : 0.5040
## Detection Rate : 0.3363
## Detection Prevalence : 0.4983
## Balanced Accuracy : 0.6704
##
## 'Positive' Class : 0
##
참고문헌은 다음과 같습니다.