Introduction

This week 2 assignment for DATA 607 will subset the data provided by the UCI OnlineNewsPopularity dataset located here:

https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

The actual dataset found here:

https://archive.ics.uci.edu/ml/machine-learning-databases/00332/OnlineNewsPopularity.zip

About the Data

The authors of this study collected over 39,000 articles from the Mashable website as a base set of data to perform predicative analytics using a novel and proactive Intelligent Decision Support System (IDSS) that analyzes articles prior to their publication. The data collected about the articles extracted a broad set of features such as keywords, digital content, and other early indicators of popularity.

The OnlineNewsPopularity dataset summarizes a set of features and statistics about articles published by Mashable (www.mashable.com) over a period of two years – 2013 and 2014. The goal is to predict the number of shares in social networks as a means of assessing the popularity of the article.

Loading and Preparing the UCI OnlineNewsPopularity Dataset

The processing below specifically requires the lubridate package. The dataset will be downloaded and unzipped within the current working directory.

Load the data from the UCI Maching Learning Repository

## download the OnlineNewsPopularity dataset 
## dataset is in a zip file
## assumes that the working directory has been set

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00332/OnlineNewsPopularity.zip"

download.file <- "OnlineNewsPopularity.zip"

if ( ! file.exists(download.file)) {
    download.file(url, download.file, mode="wb")
    unzip(download.file)
}

## load the dataset into a dataframe
## the dataset is in csv format with column headers
online_news <- read.csv("./OnlineNewsPopularity/OnlineNewsPopularity.csv", header=TRUE, stringsAsFactors = FALSE)

Let’s look at the OnlineNewsPopularity dataset from the UCI Maching Learning Repository

str(online_news)
## 'data.frame':    39644 obs. of  61 variables:
##  $ url                          : chr  "http://mashable.com/2013/01/07/amazon-instant-video-browser/" "http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/" "http://mashable.com/2013/01/07/apple-40-billion-app-downloads/" "http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/" ...
##  $ timedelta                    : num  731 731 731 731 731 731 731 731 731 731 ...
##  $ n_tokens_title               : num  12 9 9 9 13 10 8 12 11 10 ...
##  $ n_tokens_content             : num  219 255 211 531 1072 ...
##  $ n_unique_tokens              : num  0.664 0.605 0.575 0.504 0.416 ...
##  $ n_non_stop_words             : num  1 1 1 1 1 ...
##  $ n_non_stop_unique_tokens     : num  0.815 0.792 0.664 0.666 0.541 ...
##  $ num_hrefs                    : num  4 3 3 9 19 2 21 20 2 4 ...
##  $ num_self_hrefs               : num  2 1 1 0 19 2 20 20 0 1 ...
##  $ num_imgs                     : num  1 1 1 1 20 0 20 20 0 1 ...
##  $ num_videos                   : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ average_token_length         : num  4.68 4.91 4.39 4.4 4.68 ...
##  $ num_keywords                 : num  5 4 6 7 7 9 10 9 7 5 ...
##  $ data_channel_is_lifestyle    : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ data_channel_is_entertainment: num  1 0 0 1 0 0 0 0 0 0 ...
##  $ data_channel_is_bus          : num  0 1 1 0 0 0 0 0 0 0 ...
##  $ data_channel_is_socmed       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ data_channel_is_tech         : num  0 0 0 0 1 1 0 1 1 0 ...
##  $ data_channel_is_world        : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ kw_min_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_min_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_min_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ self_reference_min_shares    : num  496 0 918 0 545 8500 545 545 0 0 ...
##  $ self_reference_max_shares    : num  496 0 918 0 16000 8500 16000 16000 0 0 ...
##  $ self_reference_avg_sharess   : num  496 0 918 0 3151 ...
##  $ weekday_is_monday            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_tuesday           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_wednesday         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_thursday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_friday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_saturday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_sunday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_weekend                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LDA_00                       : num  0.5003 0.7998 0.2178 0.0286 0.0286 ...
##  $ LDA_01                       : num  0.3783 0.05 0.0333 0.4193 0.0288 ...
##  $ LDA_02                       : num  0.04 0.0501 0.0334 0.4947 0.0286 ...
##  $ LDA_03                       : num  0.0413 0.0501 0.0333 0.0289 0.0286 ...
##  $ LDA_04                       : num  0.0401 0.05 0.6822 0.0286 0.8854 ...
##  $ global_subjectivity          : num  0.522 0.341 0.702 0.43 0.514 ...
##  $ global_sentiment_polarity    : num  0.0926 0.1489 0.3233 0.1007 0.281 ...
##  $ global_rate_positive_words   : num  0.0457 0.0431 0.0569 0.0414 0.0746 ...
##  $ global_rate_negative_words   : num  0.0137 0.01569 0.00948 0.02072 0.01213 ...
##  $ rate_positive_words          : num  0.769 0.733 0.857 0.667 0.86 ...
##  $ rate_negative_words          : num  0.231 0.267 0.143 0.333 0.14 ...
##  $ avg_positive_polarity        : num  0.379 0.287 0.496 0.386 0.411 ...
##  $ min_positive_polarity        : num  0.1 0.0333 0.1 0.1364 0.0333 ...
##  $ max_positive_polarity        : num  0.7 0.7 1 0.8 1 0.6 1 1 0.8 0.5 ...
##  $ avg_negative_polarity        : num  -0.35 -0.119 -0.467 -0.37 -0.22 ...
##  $ min_negative_polarity        : num  -0.6 -0.125 -0.8 -0.6 -0.5 -0.4 -0.5 -0.5 -0.125 -0.5 ...
##  $ max_negative_polarity        : num  -0.2 -0.1 -0.133 -0.167 -0.05 ...
##  $ title_subjectivity           : num  0.5 0 0 0 0.455 ...
##  $ title_sentiment_polarity     : num  -0.188 0 0 0 0.136 ...
##  $ abs_title_subjectivity       : num  0 0.5 0.5 0.5 0.0455 ...
##  $ abs_title_sentiment_polarity : num  0.188 0 0 0 0.136 ...
##  $ shares                       : int  593 711 1500 1200 505 855 556 891 3600 710 ...

Data Dictionary

The following lists the predictor or explanatory variables being considered, as provided by the creators of the dataset:

Variable Name Description
url URL of the article
timedelta Days between the article publication and the dataset acquisition
n_tokens_title Number of words in the title
n_tokens_content Number of words in the title
n_unique_tokens Rate of unique words in the content
n_non_stop_words Rate of non-stop words in the content
n_non_stop_unique_tokens Rate of unique non-stop words in the content
num_hrefs Number of links
num_self_hrefs Number of links to other articles published by Mashable
num_imgs Number of images
num_videos Number of videos
average_token_length Average length of the words in the
num_keywords Number of keywords in the metadata
data_channel_is_lifestyle Is data channel ‘Lifestyle’?
data_channel_is_entertainment Is data channel ‘Entertainment’?
data_channel_is_bus Is data channel ‘Business’?
data_channel_is_socmed Is data channel ‘Social Media’?
data_channel_is_tech Is data channel ‘Tech’?
data_channel_is_world Is data channel ‘World’?
kw_min_min Worst keyword (min. shares)
kw_max_min Worst keyword (max. shares)
kw_avg_min Worst keyword (avg. shares)
kw_min_max Best keyword (min. shares)
kw_max_max Best keyword (max. shares)
kw_avg_max Best keyword (avg. shares)
kw_min_avg Avg. keyword (min. shares)
kw_max_avg Avg. keyword (max. shares)
kw_avg_avg Avg. keyword (avg. shares)
self_reference_min_shares Min. shares of referenced articles in Mashable
self_reference_max_shares Max. shares of referenced articles in Mashable
self_reference_avg_sharess Avg. shares of referenced articles in Mashable
weekday_is_monday Was the article published on a Monday?
weekday_is_tuesday Was the article published on a Tuesday?
weekday_is_wednesday Was the article published on a Wednesday?
weekday_is_thursday Was the article published on a Thursday?
weekday_is_friday Was the article published on a Friday?
weekday_is_saturday Was the article published on a Saturday?
weekday_is_sunday Was the article published on a Sunday?
is_weekend Was the article published on the weekend?
LDA_00 Closeness to LDA topic 0
LDA_01 Closeness to LDA topic 1
LDA_02 Closeness to LDA topic 2
LDA_03 Closeness to LDA topic 3
LDA_04 Closeness to LDA topic 4
global_subjectivity Text subjectivity
global_sentiment_polarity Text sentiment polarity
global_rate_positive_words Rate of positive words in the content
global_rate_negative_words Rate of negative words in the content
rate_positive_words Rate of positive words among non-neutral tokens
rate_negative_words Rate of negative words among non-neutral tokens
avg_positive_polarity Avg. polarity of positive words
min_positive_polarity Min. polarity of positive words
max_positive_polarity Max. polarity of positive words
avg_negative_polarity Avg. polarity of negative words
min_negative_polarity Min. polarity of negative words
max_negative_polarity Max. polarity of negative words
title_subjectivity Title subjectivity
title_sentiment_polarity Title polarity
abs_title_subjectivity Absolute subjectivity level
abs_title_sentiment_polarity Absolute polarity level
shares Number of shares (target)

Transformations applied to the dataset

1.) A new categorical variable will be created called news_channel valued with Lifestyle, Entertainment, Business, Social Media, Tech, and World. These values will be derived from the following data channel indicator variables in the dataset:

2.) A new categorical variable call day_published will be created to indicate the day of the week the news article was published. The day of the week value will be derived from the following weekday indicator variables in the dataset:

3.) The date of publication and the publication year will be derived as separate variables from the URL

4.) Variables not being considered will be removed from the final dataframe

Process the dataset into the final data frame

## keep only the specific variables needed for this analysis
keepvars <- c("url", 
              "n_tokens_title", 
              "n_tokens_content",
              "num_imgs",
              "num_videos",
              "data_channel_is_lifestyle",
              "data_channel_is_entertainment",
              "data_channel_is_bus",
              "data_channel_is_socmed",
              "data_channel_is_tech",    
              "data_channel_is_world",
              "weekday_is_monday",   
              "weekday_is_tuesday",  
              "weekday_is_wednesday",    
              "weekday_is_thursday",     
              "weekday_is_friday",   
              "weekday_is_saturday",     
              "weekday_is_sunday",
              "shares")

# remove the variables not considered in the analysis from the data frmae

online_news_df <- online_news[keepvars]


# convert the dummy variables to categorical variables for the data channel

online_news_df$news_channel <- NA 
online_news_df$news_channel[online_news_df$data_channel_is_lifestyle==1] <- "Lifestyle"
online_news_df$news_channel[online_news_df$data_channel_is_entertainment==1] <- "Entertainment"
online_news_df$news_channel[online_news_df$data_channel_is_bus==1] <- "Business"
online_news_df$news_channel[online_news_df$data_channel_is_socmed==1] <- "Social Media"
online_news_df$news_channel[online_news_df$data_channel_is_tech==1] <- "Technology"
online_news_df$news_channel[online_news_df$data_channel_is_world==1] <- "World"

# Create the News Channel variable 
online_news_df$news_channel <-  factor(online_news_df$news_channel, 
                                       levels = c("Business", 
                                                  "Entertainment", 
                                                  "Lifestyle", 
                                                  "Technology", 
                                                  "World",
                                                  "Social Media"))

# convert the dummy variables to categorical variables for the day of the week the article was published

online_news_df$day_published <- NA
online_news_df$day_published [online_news_df$weekday_is_monday==1] <- "Monday"
online_news_df$day_published [online_news_df$weekday_is_tuesday==1] <- "Tuesday"
online_news_df$day_published [online_news_df$weekday_is_wednesday==1] <- "Wednesday"
online_news_df$day_published [online_news_df$weekday_is_thursday==1] <- "Thursday"
online_news_df$day_published [online_news_df$weekday_is_friday==1] <- "Friday"
online_news_df$day_published [online_news_df$weekday_is_saturday==1] <- "Saturday"
online_news_df$day_published [online_news_df$weekday_is_sunday==1] <- "Sunday"

# create the variable for the day, date, and year of publication

online_news_df$day_published <- factor(online_news_df$day_published, 
                                       levels = c( "Monday", "Tuesday", "Wednesday", "Thursday",
                                                   "Friday", "Saturday", "Sunday"))

online_news_df$date_published <- ymd(substr(online_news_df$url, 21, 30))
online_news_df$year_published <- as.numeric(substr(online_news_df$url, 21, 24))


## drop unused variables

removevars <- c("data_channel_is_lifestyle",
                "data_channel_is_entertainment",
                "data_channel_is_bus",
                "data_channel_is_socmed",
                "data_channel_is_tech",  
                "data_channel_is_world",
                "weekday_is_monday",     
                "weekday_is_tuesday",    
                "weekday_is_wednesday",  
                "weekday_is_thursday",   
                "weekday_is_friday",     
                "weekday_is_saturday",   
                "weekday_is_sunday")

online_news_df <- online_news_df[, !(colnames(online_news_df) %in% removevars)]

## keep only complete cases within the dataset.  Some day and channel indicators are not valued in the original dataset.

online_news_df <- online_news_df[complete.cases(online_news_df), ]

#remove the original online news dataframe
online_news <- NULL

# Look at final Online News Popularity data frame
str(online_news_df)
## 'data.frame':    33510 obs. of  10 variables:
##  $ url             : chr  "http://mashable.com/2013/01/07/amazon-instant-video-browser/" "http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/" "http://mashable.com/2013/01/07/apple-40-billion-app-downloads/" "http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/" ...
##  $ n_tokens_title  : num  12 9 9 9 13 10 8 12 11 10 ...
##  $ n_tokens_content: num  219 255 211 531 1072 ...
##  $ num_imgs        : num  1 1 1 1 20 0 20 20 0 1 ...
##  $ num_videos      : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ shares          : int  593 711 1500 1200 505 855 556 891 3600 710 ...
##  $ news_channel    : Factor w/ 6 levels "Business","Entertainment",..: 2 1 1 2 4 4 3 4 4 5 ...
##  $ day_published   : Factor w/ 7 levels "Monday","Tuesday",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ date_published  : POSIXct, format: "2013-01-07" "2013-01-07" ...
##  $ year_published  : num  2013 2013 2013 2013 2013 ...
summary(online_news_df)
##      url            n_tokens_title  n_tokens_content    num_imgs      
##  Length:33510       Min.   : 2.00   Min.   :   0.0   Min.   :  0.000  
##  Class :character   1st Qu.: 9.00   1st Qu.: 272.0   1st Qu.:  1.000  
##  Mode  :character   Median :10.00   Median : 447.0   Median :  1.000  
##                     Mean   :10.42   Mean   : 585.4   Mean   :  3.959  
##                     3rd Qu.:12.00   3rd Qu.: 761.0   3rd Qu.:  3.000  
##                     Max.   :23.00   Max.   :8474.0   Max.   :128.000  
##                                                                       
##    num_videos          shares              news_channel    day_published 
##  Min.   : 0.0000   Min.   :     1   Business     :6258   Monday   :5761  
##  1st Qu.: 0.0000   1st Qu.:   930   Entertainment:7057   Tuesday  :6279  
##  Median : 0.0000   Median :  1400   Lifestyle    :2099   Wednesday:6352  
##  Mean   : 0.9984   Mean   :  2929   Technology   :7346   Thursday :6165  
##  3rd Qu.: 1.0000   3rd Qu.:  2500   World        :8427   Friday   :4735  
##  Max.   :75.0000   Max.   :690400   Social Media :2323   Saturday :2029  
##                                                          Sunday   :2189  
##  date_published                year_published
##  Min.   :2013-01-07 00:00:00   Min.   :2013  
##  1st Qu.:2013-07-16 00:00:00   1st Qu.:2013  
##  Median :2014-02-06 00:00:00   Median :2014  
##  Mean   :2014-01-19 13:35:05   Mean   :2014  
##  3rd Qu.:2014-07-27 00:00:00   3rd Qu.:2014  
##  Max.   :2014-12-27 00:00:00   Max.   :2014  
## 

Final Data Frame

The following lists the variables derived or retained in the final data frame.

Data Dictionary

Variable Name Description
url URL of the article
n_tokens_title Number of words in the title
n_tokens_content Number of words in the title
num_imgs Number of images
num_videos Number of videos
shares Number of shares (target)
news_channel Factor: Business, Entertainment, Lifestyle, Technology, World, Social Media
day_published Factor: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday
year_published Year of article’s publication: 2013 or 2014
date_published Date of article’s publication

Subsetting by Year

If we want create data frames by year, we can subset the data using the year_published:

online_news_df.2013 <- subset(online_news_df, year_published == 2013)
online_news_df.2014 <- subset(online_news_df, year_published == 2014)

# show summary statistics for 2013
summary(online_news_df.2013)
##      url            n_tokens_title   n_tokens_content    num_imgs      
##  Length:15192       Min.   : 2.000   Min.   :   0.0   Min.   :  0.000  
##  Class :character   1st Qu.: 9.000   1st Qu.: 247.0   1st Qu.:  1.000  
##  Mode  :character   Median :10.000   Median : 408.5   Median :  1.000  
##                     Mean   : 9.895   Mean   : 542.8   Mean   :  3.531  
##                     3rd Qu.:11.000   3rd Qu.: 696.0   3rd Qu.:  1.000  
##                     Max.   :18.000   Max.   :6505.0   Max.   :101.000  
##                                                                        
##    num_videos          shares              news_channel    day_published 
##  Min.   : 0.0000   Min.   :     1   Business     :3194   Monday   :2586  
##  1st Qu.: 0.0000   1st Qu.:   914   Entertainment:2862   Tuesday  :2860  
##  Median : 0.0000   Median :  1400   Lifestyle    :1191   Wednesday:2952  
##  Mean   : 0.8237   Mean   :  3036   Technology   :3942   Thursday :2801  
##  3rd Qu.: 1.0000   3rd Qu.:  2700   World        :2634   Friday   :2135  
##  Max.   :75.0000   Max.   :690400   Social Media :1369   Saturday : 908  
##                                                          Sunday   : 950  
##  date_published                year_published
##  Min.   :2013-01-07 00:00:00   Min.   :2013  
##  1st Qu.:2013-03-29 00:00:00   1st Qu.:2013  
##  Median :2013-06-26 00:00:00   Median :2013  
##  Mean   :2013-06-28 09:05:46   Mean   :2013  
##  3rd Qu.:2013-09-25 00:00:00   3rd Qu.:2013  
##  Max.   :2013-12-31 00:00:00   Max.   :2013  
## 
# show summary statistics for 2014
summary(online_news_df.2014)
##      url            n_tokens_title  n_tokens_content    num_imgs      
##  Length:18318       Min.   : 3.00   Min.   :   0.0   Min.   :  0.000  
##  Class :character   1st Qu.: 9.00   1st Qu.: 294.0   1st Qu.:  1.000  
##  Mode  :character   Median :11.00   Median : 482.0   Median :  1.000  
##                     Mean   :10.85   Mean   : 620.8   Mean   :  4.315  
##                     3rd Qu.:12.00   3rd Qu.: 814.0   3rd Qu.:  3.000  
##                     Max.   :23.00   Max.   :8474.0   Max.   :128.000  
##                                                                       
##    num_videos         shares                news_channel    day_published 
##  Min.   : 0.000   Min.   :     5.0   Business     :3064   Monday   :3175  
##  1st Qu.: 0.000   1st Qu.:   940.2   Entertainment:4195   Tuesday  :3419  
##  Median : 0.000   Median :  1400.0   Lifestyle    : 908   Wednesday:3400  
##  Mean   : 1.143   Mean   :  2839.5   Technology   :3404   Thursday :3364  
##  3rd Qu.: 1.000   3rd Qu.:  2400.0   World        :5793   Friday   :2600  
##  Max.   :75.000   Max.   :663600.0   Social Media : 954   Saturday :1121  
##                                                           Sunday   :1239  
##  date_published                year_published
##  Min.   :2014-01-01 00:00:00   Min.   :2014  
##  1st Qu.:2014-04-15 00:00:00   1st Qu.:2014  
##  Median :2014-07-14 00:00:00   Median :2014  
##  Mean   :2014-07-08 17:42:02   Mean   :2014  
##  3rd Qu.:2014-10-06 00:00:00   3rd Qu.:2014  
##  Max.   :2014-12-27 00:00:00   Max.   :2014  
##