This week 2 assignment for DATA 607 will subset the data provided by the UCI OnlineNewsPopularity dataset located here:
https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
The actual dataset found here:
https://archive.ics.uci.edu/ml/machine-learning-databases/00332/OnlineNewsPopularity.zip
The authors of this study collected over 39,000 articles from the Mashable website as a base set of data to perform predicative analytics using a novel and proactive Intelligent Decision Support System (IDSS) that analyzes articles prior to their publication. The data collected about the articles extracted a broad set of features such as keywords, digital content, and other early indicators of popularity.
The OnlineNewsPopularity dataset summarizes a set of features and statistics about articles published by Mashable (www.mashable.com) over a period of two years – 2013 and 2014. The goal is to predict the number of shares in social networks as a means of assessing the popularity of the article.
The processing below specifically requires the lubridate
package. The dataset will be downloaded and unzipped within the current working directory.
Load the data from the UCI Maching Learning Repository
## download the OnlineNewsPopularity dataset
## dataset is in a zip file
## assumes that the working directory has been set
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00332/OnlineNewsPopularity.zip"
download.file <- "OnlineNewsPopularity.zip"
if ( ! file.exists(download.file)) {
download.file(url, download.file, mode="wb")
unzip(download.file)
}
## load the dataset into a dataframe
## the dataset is in csv format with column headers
online_news <- read.csv("./OnlineNewsPopularity/OnlineNewsPopularity.csv", header=TRUE, stringsAsFactors = FALSE)
Let’s look at the OnlineNewsPopularity dataset from the UCI Maching Learning Repository
str(online_news)
## 'data.frame': 39644 obs. of 61 variables:
## $ url : chr "http://mashable.com/2013/01/07/amazon-instant-video-browser/" "http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/" "http://mashable.com/2013/01/07/apple-40-billion-app-downloads/" "http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/" ...
## $ timedelta : num 731 731 731 731 731 731 731 731 731 731 ...
## $ n_tokens_title : num 12 9 9 9 13 10 8 12 11 10 ...
## $ n_tokens_content : num 219 255 211 531 1072 ...
## $ n_unique_tokens : num 0.664 0.605 0.575 0.504 0.416 ...
## $ n_non_stop_words : num 1 1 1 1 1 ...
## $ n_non_stop_unique_tokens : num 0.815 0.792 0.664 0.666 0.541 ...
## $ num_hrefs : num 4 3 3 9 19 2 21 20 2 4 ...
## $ num_self_hrefs : num 2 1 1 0 19 2 20 20 0 1 ...
## $ num_imgs : num 1 1 1 1 20 0 20 20 0 1 ...
## $ num_videos : num 0 0 0 0 0 0 0 0 0 1 ...
## $ average_token_length : num 4.68 4.91 4.39 4.4 4.68 ...
## $ num_keywords : num 5 4 6 7 7 9 10 9 7 5 ...
## $ data_channel_is_lifestyle : num 0 0 0 0 0 0 1 0 0 0 ...
## $ data_channel_is_entertainment: num 1 0 0 1 0 0 0 0 0 0 ...
## $ data_channel_is_bus : num 0 1 1 0 0 0 0 0 0 0 ...
## $ data_channel_is_socmed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ data_channel_is_tech : num 0 0 0 0 1 1 0 1 1 0 ...
## $ data_channel_is_world : num 0 0 0 0 0 0 0 0 0 1 ...
## $ kw_min_min : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_max_min : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_avg_min : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_min_max : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_max_max : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_avg_max : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_min_avg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_max_avg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_avg_avg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ self_reference_min_shares : num 496 0 918 0 545 8500 545 545 0 0 ...
## $ self_reference_max_shares : num 496 0 918 0 16000 8500 16000 16000 0 0 ...
## $ self_reference_avg_sharess : num 496 0 918 0 3151 ...
## $ weekday_is_monday : num 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_tuesday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_wednesday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_thursday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_friday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_saturday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_sunday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ is_weekend : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LDA_00 : num 0.5003 0.7998 0.2178 0.0286 0.0286 ...
## $ LDA_01 : num 0.3783 0.05 0.0333 0.4193 0.0288 ...
## $ LDA_02 : num 0.04 0.0501 0.0334 0.4947 0.0286 ...
## $ LDA_03 : num 0.0413 0.0501 0.0333 0.0289 0.0286 ...
## $ LDA_04 : num 0.0401 0.05 0.6822 0.0286 0.8854 ...
## $ global_subjectivity : num 0.522 0.341 0.702 0.43 0.514 ...
## $ global_sentiment_polarity : num 0.0926 0.1489 0.3233 0.1007 0.281 ...
## $ global_rate_positive_words : num 0.0457 0.0431 0.0569 0.0414 0.0746 ...
## $ global_rate_negative_words : num 0.0137 0.01569 0.00948 0.02072 0.01213 ...
## $ rate_positive_words : num 0.769 0.733 0.857 0.667 0.86 ...
## $ rate_negative_words : num 0.231 0.267 0.143 0.333 0.14 ...
## $ avg_positive_polarity : num 0.379 0.287 0.496 0.386 0.411 ...
## $ min_positive_polarity : num 0.1 0.0333 0.1 0.1364 0.0333 ...
## $ max_positive_polarity : num 0.7 0.7 1 0.8 1 0.6 1 1 0.8 0.5 ...
## $ avg_negative_polarity : num -0.35 -0.119 -0.467 -0.37 -0.22 ...
## $ min_negative_polarity : num -0.6 -0.125 -0.8 -0.6 -0.5 -0.4 -0.5 -0.5 -0.125 -0.5 ...
## $ max_negative_polarity : num -0.2 -0.1 -0.133 -0.167 -0.05 ...
## $ title_subjectivity : num 0.5 0 0 0 0.455 ...
## $ title_sentiment_polarity : num -0.188 0 0 0 0.136 ...
## $ abs_title_subjectivity : num 0 0.5 0.5 0.5 0.0455 ...
## $ abs_title_sentiment_polarity : num 0.188 0 0 0 0.136 ...
## $ shares : int 593 711 1500 1200 505 855 556 891 3600 710 ...
The following lists the predictor or explanatory variables being considered, as provided by the creators of the dataset:
Variable Name | Description |
---|---|
url |
URL of the article |
timedelta |
Days between the article publication and the dataset acquisition |
n_tokens_title |
Number of words in the title |
n_tokens_content |
Number of words in the title |
n_unique_tokens |
Rate of unique words in the content |
n_non_stop_words |
Rate of non-stop words in the content |
n_non_stop_unique_tokens |
Rate of unique non-stop words in the content |
num_hrefs |
Number of links |
num_self_hrefs |
Number of links to other articles published by Mashable |
num_imgs |
Number of images |
num_videos |
Number of videos |
average_token_length |
Average length of the words in the |
num_keywords |
Number of keywords in the metadata |
data_channel_is_lifestyle |
Is data channel ‘Lifestyle’? |
data_channel_is_entertainment |
Is data channel ‘Entertainment’? |
data_channel_is_bus |
Is data channel ‘Business’? |
data_channel_is_socmed |
Is data channel ‘Social Media’? |
data_channel_is_tech |
Is data channel ‘Tech’? |
data_channel_is_world |
Is data channel ‘World’? |
kw_min_min |
Worst keyword (min. shares) |
kw_max_min |
Worst keyword (max. shares) |
kw_avg_min |
Worst keyword (avg. shares) |
kw_min_max |
Best keyword (min. shares) |
kw_max_max |
Best keyword (max. shares) |
kw_avg_max |
Best keyword (avg. shares) |
kw_min_avg |
Avg. keyword (min. shares) |
kw_max_avg |
Avg. keyword (max. shares) |
kw_avg_avg |
Avg. keyword (avg. shares) |
self_reference_min_shares |
Min. shares of referenced articles in Mashable |
self_reference_max_shares |
Max. shares of referenced articles in Mashable |
self_reference_avg_sharess |
Avg. shares of referenced articles in Mashable |
weekday_is_monday |
Was the article published on a Monday? |
weekday_is_tuesday |
Was the article published on a Tuesday? |
weekday_is_wednesday |
Was the article published on a Wednesday? |
weekday_is_thursday |
Was the article published on a Thursday? |
weekday_is_friday |
Was the article published on a Friday? |
weekday_is_saturday |
Was the article published on a Saturday? |
weekday_is_sunday |
Was the article published on a Sunday? |
is_weekend |
Was the article published on the weekend? |
LDA_00 |
Closeness to LDA topic 0 |
LDA_01 |
Closeness to LDA topic 1 |
LDA_02 |
Closeness to LDA topic 2 |
LDA_03 |
Closeness to LDA topic 3 |
LDA_04 |
Closeness to LDA topic 4 |
global_subjectivity |
Text subjectivity |
global_sentiment_polarity |
Text sentiment polarity |
global_rate_positive_words |
Rate of positive words in the content |
global_rate_negative_words |
Rate of negative words in the content |
rate_positive_words |
Rate of positive words among non-neutral tokens |
rate_negative_words |
Rate of negative words among non-neutral tokens |
avg_positive_polarity |
Avg. polarity of positive words |
min_positive_polarity |
Min. polarity of positive words |
max_positive_polarity |
Max. polarity of positive words |
avg_negative_polarity |
Avg. polarity of negative words |
min_negative_polarity |
Min. polarity of negative words |
max_negative_polarity |
Max. polarity of negative words |
title_subjectivity |
Title subjectivity |
title_sentiment_polarity |
Title polarity |
abs_title_subjectivity |
Absolute subjectivity level |
abs_title_sentiment_polarity |
Absolute polarity level |
shares |
Number of shares (target) |
Transformations applied to the dataset
1.) A new categorical variable will be created called news_channel
valued with Lifestyle, Entertainment, Business, Social Media, Tech, and World. These values will be derived from the following data channel indicator variables in the dataset:
2.) A new categorical variable call day_published
will be created to indicate the day of the week the news article was published. The day of the week value will be derived from the following weekday indicator variables in the dataset:
3.) The date of publication and the publication year will be derived as separate variables from the URL
4.) Variables not being considered will be removed from the final dataframe
## keep only the specific variables needed for this analysis
keepvars <- c("url",
"n_tokens_title",
"n_tokens_content",
"num_imgs",
"num_videos",
"data_channel_is_lifestyle",
"data_channel_is_entertainment",
"data_channel_is_bus",
"data_channel_is_socmed",
"data_channel_is_tech",
"data_channel_is_world",
"weekday_is_monday",
"weekday_is_tuesday",
"weekday_is_wednesday",
"weekday_is_thursday",
"weekday_is_friday",
"weekday_is_saturday",
"weekday_is_sunday",
"shares")
# remove the variables not considered in the analysis from the data frmae
online_news_df <- online_news[keepvars]
# convert the dummy variables to categorical variables for the data channel
online_news_df$news_channel <- NA
online_news_df$news_channel[online_news_df$data_channel_is_lifestyle==1] <- "Lifestyle"
online_news_df$news_channel[online_news_df$data_channel_is_entertainment==1] <- "Entertainment"
online_news_df$news_channel[online_news_df$data_channel_is_bus==1] <- "Business"
online_news_df$news_channel[online_news_df$data_channel_is_socmed==1] <- "Social Media"
online_news_df$news_channel[online_news_df$data_channel_is_tech==1] <- "Technology"
online_news_df$news_channel[online_news_df$data_channel_is_world==1] <- "World"
# Create the News Channel variable
online_news_df$news_channel <- factor(online_news_df$news_channel,
levels = c("Business",
"Entertainment",
"Lifestyle",
"Technology",
"World",
"Social Media"))
# convert the dummy variables to categorical variables for the day of the week the article was published
online_news_df$day_published <- NA
online_news_df$day_published [online_news_df$weekday_is_monday==1] <- "Monday"
online_news_df$day_published [online_news_df$weekday_is_tuesday==1] <- "Tuesday"
online_news_df$day_published [online_news_df$weekday_is_wednesday==1] <- "Wednesday"
online_news_df$day_published [online_news_df$weekday_is_thursday==1] <- "Thursday"
online_news_df$day_published [online_news_df$weekday_is_friday==1] <- "Friday"
online_news_df$day_published [online_news_df$weekday_is_saturday==1] <- "Saturday"
online_news_df$day_published [online_news_df$weekday_is_sunday==1] <- "Sunday"
# create the variable for the day, date, and year of publication
online_news_df$day_published <- factor(online_news_df$day_published,
levels = c( "Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"))
online_news_df$date_published <- ymd(substr(online_news_df$url, 21, 30))
online_news_df$year_published <- as.numeric(substr(online_news_df$url, 21, 24))
## drop unused variables
removevars <- c("data_channel_is_lifestyle",
"data_channel_is_entertainment",
"data_channel_is_bus",
"data_channel_is_socmed",
"data_channel_is_tech",
"data_channel_is_world",
"weekday_is_monday",
"weekday_is_tuesday",
"weekday_is_wednesday",
"weekday_is_thursday",
"weekday_is_friday",
"weekday_is_saturday",
"weekday_is_sunday")
online_news_df <- online_news_df[, !(colnames(online_news_df) %in% removevars)]
## keep only complete cases within the dataset. Some day and channel indicators are not valued in the original dataset.
online_news_df <- online_news_df[complete.cases(online_news_df), ]
#remove the original online news dataframe
online_news <- NULL
# Look at final Online News Popularity data frame
str(online_news_df)
## 'data.frame': 33510 obs. of 10 variables:
## $ url : chr "http://mashable.com/2013/01/07/amazon-instant-video-browser/" "http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/" "http://mashable.com/2013/01/07/apple-40-billion-app-downloads/" "http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/" ...
## $ n_tokens_title : num 12 9 9 9 13 10 8 12 11 10 ...
## $ n_tokens_content: num 219 255 211 531 1072 ...
## $ num_imgs : num 1 1 1 1 20 0 20 20 0 1 ...
## $ num_videos : num 0 0 0 0 0 0 0 0 0 1 ...
## $ shares : int 593 711 1500 1200 505 855 556 891 3600 710 ...
## $ news_channel : Factor w/ 6 levels "Business","Entertainment",..: 2 1 1 2 4 4 3 4 4 5 ...
## $ day_published : Factor w/ 7 levels "Monday","Tuesday",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ date_published : POSIXct, format: "2013-01-07" "2013-01-07" ...
## $ year_published : num 2013 2013 2013 2013 2013 ...
summary(online_news_df)
## url n_tokens_title n_tokens_content num_imgs
## Length:33510 Min. : 2.00 Min. : 0.0 Min. : 0.000
## Class :character 1st Qu.: 9.00 1st Qu.: 272.0 1st Qu.: 1.000
## Mode :character Median :10.00 Median : 447.0 Median : 1.000
## Mean :10.42 Mean : 585.4 Mean : 3.959
## 3rd Qu.:12.00 3rd Qu.: 761.0 3rd Qu.: 3.000
## Max. :23.00 Max. :8474.0 Max. :128.000
##
## num_videos shares news_channel day_published
## Min. : 0.0000 Min. : 1 Business :6258 Monday :5761
## 1st Qu.: 0.0000 1st Qu.: 930 Entertainment:7057 Tuesday :6279
## Median : 0.0000 Median : 1400 Lifestyle :2099 Wednesday:6352
## Mean : 0.9984 Mean : 2929 Technology :7346 Thursday :6165
## 3rd Qu.: 1.0000 3rd Qu.: 2500 World :8427 Friday :4735
## Max. :75.0000 Max. :690400 Social Media :2323 Saturday :2029
## Sunday :2189
## date_published year_published
## Min. :2013-01-07 00:00:00 Min. :2013
## 1st Qu.:2013-07-16 00:00:00 1st Qu.:2013
## Median :2014-02-06 00:00:00 Median :2014
## Mean :2014-01-19 13:35:05 Mean :2014
## 3rd Qu.:2014-07-27 00:00:00 3rd Qu.:2014
## Max. :2014-12-27 00:00:00 Max. :2014
##
The following lists the variables derived or retained in the final data frame.
Variable Name | Description |
---|---|
url |
URL of the article |
n_tokens_title |
Number of words in the title |
n_tokens_content |
Number of words in the title |
num_imgs |
Number of images |
num_videos |
Number of videos |
shares |
Number of shares (target) |
news_channel |
Factor: Business, Entertainment, Lifestyle, Technology, World, Social Media |
day_published |
Factor: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday |
year_published |
Year of article’s publication: 2013 or 2014 |
date_published |
Date of article’s publication |
If we want create data frames by year, we can subset the data using the year_published
:
online_news_df.2013 <- subset(online_news_df, year_published == 2013)
online_news_df.2014 <- subset(online_news_df, year_published == 2014)
# show summary statistics for 2013
summary(online_news_df.2013)
## url n_tokens_title n_tokens_content num_imgs
## Length:15192 Min. : 2.000 Min. : 0.0 Min. : 0.000
## Class :character 1st Qu.: 9.000 1st Qu.: 247.0 1st Qu.: 1.000
## Mode :character Median :10.000 Median : 408.5 Median : 1.000
## Mean : 9.895 Mean : 542.8 Mean : 3.531
## 3rd Qu.:11.000 3rd Qu.: 696.0 3rd Qu.: 1.000
## Max. :18.000 Max. :6505.0 Max. :101.000
##
## num_videos shares news_channel day_published
## Min. : 0.0000 Min. : 1 Business :3194 Monday :2586
## 1st Qu.: 0.0000 1st Qu.: 914 Entertainment:2862 Tuesday :2860
## Median : 0.0000 Median : 1400 Lifestyle :1191 Wednesday:2952
## Mean : 0.8237 Mean : 3036 Technology :3942 Thursday :2801
## 3rd Qu.: 1.0000 3rd Qu.: 2700 World :2634 Friday :2135
## Max. :75.0000 Max. :690400 Social Media :1369 Saturday : 908
## Sunday : 950
## date_published year_published
## Min. :2013-01-07 00:00:00 Min. :2013
## 1st Qu.:2013-03-29 00:00:00 1st Qu.:2013
## Median :2013-06-26 00:00:00 Median :2013
## Mean :2013-06-28 09:05:46 Mean :2013
## 3rd Qu.:2013-09-25 00:00:00 3rd Qu.:2013
## Max. :2013-12-31 00:00:00 Max. :2013
##
# show summary statistics for 2014
summary(online_news_df.2014)
## url n_tokens_title n_tokens_content num_imgs
## Length:18318 Min. : 3.00 Min. : 0.0 Min. : 0.000
## Class :character 1st Qu.: 9.00 1st Qu.: 294.0 1st Qu.: 1.000
## Mode :character Median :11.00 Median : 482.0 Median : 1.000
## Mean :10.85 Mean : 620.8 Mean : 4.315
## 3rd Qu.:12.00 3rd Qu.: 814.0 3rd Qu.: 3.000
## Max. :23.00 Max. :8474.0 Max. :128.000
##
## num_videos shares news_channel day_published
## Min. : 0.000 Min. : 5.0 Business :3064 Monday :3175
## 1st Qu.: 0.000 1st Qu.: 940.2 Entertainment:4195 Tuesday :3419
## Median : 0.000 Median : 1400.0 Lifestyle : 908 Wednesday:3400
## Mean : 1.143 Mean : 2839.5 Technology :3404 Thursday :3364
## 3rd Qu.: 1.000 3rd Qu.: 2400.0 World :5793 Friday :2600
## Max. :75.000 Max. :663600.0 Social Media : 954 Saturday :1121
## Sunday :1239
## date_published year_published
## Min. :2014-01-01 00:00:00 Min. :2014
## 1st Qu.:2014-04-15 00:00:00 1st Qu.:2014
## Median :2014-07-14 00:00:00 Median :2014
## Mean :2014-07-08 17:42:02 Mean :2014
## 3rd Qu.:2014-10-06 00:00:00 3rd Qu.:2014
## Max. :2014-12-27 00:00:00 Max. :2014
##