DATA 607 Week 2 Assignment - Subsetting Datasets

Introduction

This week 2 assignment for DATA 607 will subset the data provided by the UCI OnlineNewsPopularity dataset located here:

https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

The actual dataset found here:

https://archive.ics.uci.edu/ml/machine-learning-databases/00332/OnlineNewsPopularity.zip

About the Data

The authors of this study collected over 39,000 articles from the Mashable website as a base set of data to perform predicative analytics using a novel and proactive Intelligent Decision Support System (IDSS) that analyzes articles prior to their publication. The data collected about the articles extracted a broad set of features such as keywords, digital content, and other early indicators of popularity.

The OnlineNewsPopularity dataset summarizes a set of features and statistics about articles published by Mashable (www.mashable.com) over a period of two years – 2013 and 2014. The goal is to predict the number of shares in social networks as a means of assessing the popularity of the article.

Loading and Preparing the UCI OnlineNewsPopularity Dataset

The processing below specifically requires the lubridate package. The dataset will be downloaded and unzipped within the current working directory.

Load the data from the UCI Maching Learning Repository

## download the OnlineNewsPopularity dataset 
## dataset is in a zip file
## assumes that the working directory has been set

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00332/OnlineNewsPopularity.zip"

download.file <- "OnlineNewsPopularity.zip"

if ( ! file.exists(download.file)) {
    download.file(url, download.file, mode="wb")
    unzip(download.file)
}

## load the dataset into a dataframe
## the dataset is in csv format with column headers
online_news <- read.csv("./OnlineNewsPopularity/OnlineNewsPopularity.csv", header=TRUE, stringsAsFactors = FALSE)

Let’s look at the OnlineNewsPopularity dataset from the UCI Maching Learning Repository

str(online_news)

## 'data.frame':    39644 obs. of  61 variables:
##  $ url                          : chr  "http://mashable.com/2013/01/07/amazon-instant-video-browser/" "http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/" "http://mashable.com/2013/01/07/apple-40-billion-app-downloads/" "http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/" ...
##  $ timedelta                    : num  731 731 731 731 731 731 731 731 731 731 ...
##  $ n_tokens_title               : num  12 9 9 9 13 10 8 12 11 10 ...
##  $ n_tokens_content             : num  219 255 211 531 1072 ...
##  $ n_unique_tokens              : num  0.664 0.605 0.575 0.504 0.416 ...
##  $ n_non_stop_words             : num  1 1 1 1 1 ...
##  $ n_non_stop_unique_tokens     : num  0.815 0.792 0.664 0.666 0.541 ...
##  $ num_hrefs                    : num  4 3 3 9 19 2 21 20 2 4 ...
##  $ num_self_hrefs               : num  2 1 1 0 19 2 20 20 0 1 ...
##  $ num_imgs                     : num  1 1 1 1 20 0 20 20 0 1 ...
##  $ num_videos                   : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ average_token_length         : num  4.68 4.91 4.39 4.4 4.68 ...
##  $ num_keywords                 : num  5 4 6 7 7 9 10 9 7 5 ...
##  $ data_channel_is_lifestyle    : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ data_channel_is_entertainment: num  1 0 0 1 0 0 0 0 0 0 ...
##  $ data_channel_is_bus          : num  0 1 1 0 0 0 0 0 0 0 ...
##  $ data_channel_is_socmed       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ data_channel_is_tech         : num  0 0 0 0 1 1 0 1 1 0 ...
##  $ data_channel_is_world        : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ kw_min_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_min_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_min_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ self_reference_min_shares    : num  496 0 918 0 545 8500 545 545 0 0 ...
##  $ self_reference_max_shares    : num  496 0 918 0 16000 8500 16000 16000 0 0 ...
##  $ self_reference_avg_sharess   : num  496 0 918 0 3151 ...
##  $ weekday_is_monday            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_tuesday           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_wednesday         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_thursday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_friday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_saturday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_sunday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_weekend                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LDA_00                       : num  0.5003 0.7998 0.2178 0.0286 0.0286 ...
##  $ LDA_01                       : num  0.3783 0.05 0.0333 0.4193 0.0288 ...
##  $ LDA_02                       : num  0.04 0.0501 0.0334 0.4947 0.0286 ...
##  $ LDA_03                       : num  0.0413 0.0501 0.0333 0.0289 0.0286 ...
##  $ LDA_04                       : num  0.0401 0.05 0.6822 0.0286 0.8854 ...
##  $ global_subjectivity          : num  0.522 0.341 0.702 0.43 0.514 ...
##  $ global_sentiment_polarity    : num  0.0926 0.1489 0.3233 0.1007 0.281 ...
##  $ global_rate_positive_words   : num  0.0457 0.0431 0.0569 0.0414 0.0746 ...
##  $ global_rate_negative_words   : num  0.0137 0.01569 0.00948 0.02072 0.01213 ...
##  $ rate_positive_words          : num  0.769 0.733 0.857 0.667 0.86 ...
##  $ rate_negative_words          : num  0.231 0.267 0.143 0.333 0.14 ...
##  $ avg_positive_polarity        : num  0.379 0.287 0.496 0.386 0.411 ...
##  $ min_positive_polarity        : num  0.1 0.0333 0.1 0.1364 0.0333 ...
##  $ max_positive_polarity        : num  0.7 0.7 1 0.8 1 0.6 1 1 0.8 0.5 ...
##  $ avg_negative_polarity        : num  -0.35 -0.119 -0.467 -0.37 -0.22 ...
##  $ min_negative_polarity        : num  -0.6 -0.125 -0.8 -0.6 -0.5 -0.4 -0.5 -0.5 -0.125 -0.5 ...
##  $ max_negative_polarity        : num  -0.2 -0.1 -0.133 -0.167 -0.05 ...
##  $ title_subjectivity           : num  0.5 0 0 0 0.455 ...
##  $ title_sentiment_polarity     : num  -0.188 0 0 0 0.136 ...
##  $ abs_title_subjectivity       : num  0 0.5 0.5 0.5 0.0455 ...
##  $ abs_title_sentiment_polarity : num  0.188 0 0 0 0.136 ...
##  $ shares                       : int  593 711 1500 1200 505 855 556 891 3600 710 ...

Data Dictionary

The following lists the predictor or explanatory variables being considered, as provided by the creators of the dataset:

Variable Name	Description
`url`	URL of the article
`timedelta`	Days between the article publication and the dataset acquisition
`n_tokens_title`	Number of words in the title
`n_tokens_content`	Number of words in the title
`n_unique_tokens`	Rate of unique words in the content
`n_non_stop_words`	Rate of non-stop words in the content
`n_non_stop_unique_tokens`	Rate of unique non-stop words in the content
`num_hrefs`	Number of links
`num_self_hrefs`	Number of links to other articles published by Mashable
`num_imgs`	Number of images
`num_videos`	Number of videos
`average_token_length`	Average length of the words in the
`num_keywords`	Number of keywords in the metadata
`data_channel_is_lifestyle`	Is data channel ‘Lifestyle’?
`data_channel_is_entertainment`	Is data channel ‘Entertainment’?
`data_channel_is_bus`	Is data channel ‘Business’?
`data_channel_is_socmed`	Is data channel ‘Social Media’?
`data_channel_is_tech`	Is data channel ‘Tech’?
`data_channel_is_world`	Is data channel ‘World’?
`kw_min_min`	Worst keyword (min. shares)
`kw_max_min`	Worst keyword (max. shares)
`kw_avg_min`	Worst keyword (avg. shares)
`kw_min_max`	Best keyword (min. shares)
`kw_max_max`	Best keyword (max. shares)
`kw_avg_max`	Best keyword (avg. shares)
`kw_min_avg`	Avg. keyword (min. shares)
`kw_max_avg`	Avg. keyword (max. shares)
`kw_avg_avg`	Avg. keyword (avg. shares)
`self_reference_min_shares`	Min. shares of referenced articles in Mashable
`self_reference_max_shares`	Max. shares of referenced articles in Mashable
`self_reference_avg_sharess`	Avg. shares of referenced articles in Mashable
`weekday_is_monday`	Was the article published on a Monday?
`weekday_is_tuesday`	Was the article published on a Tuesday?
`weekday_is_wednesday`	Was the article published on a Wednesday?
`weekday_is_thursday`	Was the article published on a Thursday?
`weekday_is_friday`	Was the article published on a Friday?
`weekday_is_saturday`	Was the article published on a Saturday?
`weekday_is_sunday`	Was the article published on a Sunday?
`is_weekend`	Was the article published on the weekend?
`LDA_00`	Closeness to LDA topic 0
`LDA_01`	Closeness to LDA topic 1
`LDA_02`	Closeness to LDA topic 2
`LDA_03`	Closeness to LDA topic 3
`LDA_04`	Closeness to LDA topic 4
`global_subjectivity`	Text subjectivity
`global_sentiment_polarity`	Text sentiment polarity
`global_rate_positive_words`	Rate of positive words in the content
`global_rate_negative_words`	Rate of negative words in the content
`rate_positive_words`	Rate of positive words among non-neutral tokens
`rate_negative_words`	Rate of negative words among non-neutral tokens
`avg_positive_polarity`	Avg. polarity of positive words
`min_positive_polarity`	Min. polarity of positive words
`max_positive_polarity`	Max. polarity of positive words
`avg_negative_polarity`	Avg. polarity of negative words
`min_negative_polarity`	Min. polarity of negative words
`max_negative_polarity`	Max. polarity of negative words
`title_subjectivity`	Title subjectivity
`title_sentiment_polarity`	Title polarity
`abs_title_subjectivity`	Absolute subjectivity level
`abs_title_sentiment_polarity`	Absolute polarity level
`shares`	Number of shares (target)

Transformations applied to the dataset

1.) A new categorical variable will be created called news_channel valued with Lifestyle, Entertainment, Business, Social Media, Tech, and World. These values will be derived from the following data channel indicator variables in the dataset:

data_channel_is_lifestyle
data_channel_is_entertainment
data_channel_is_bus
data_channel_is_socmed
data_channel_is_tech
data_channel_is_world

2.) A new categorical variable call day_published will be created to indicate the day of the week the news article was published. The day of the week value will be derived from the following weekday indicator variables in the dataset:

weekday_is_monday
weekday_is_tuesday
weekday_is_wednesday
weekday_is_thursday
weekday_is_friday
weekday_is_saturday
weekday_is_sunday

3.) The date of publication and the publication year will be derived as separate variables from the URL

4.) Variables not being considered will be removed from the final dataframe

Process the dataset into the final data frame

## keep only the specific variables needed for this analysis
keepvars <- c("url", 
              "n_tokens_title", 
              "n_tokens_content",
              "num_imgs",
              "num_videos",
              "data_channel_is_lifestyle",
              "data_channel_is_entertainment",
              "data_channel_is_bus",
              "data_channel_is_socmed",
              "data_channel_is_tech",    
              "data_channel_is_world",
              "weekday_is_monday",   
              "weekday_is_tuesday",  
              "weekday_is_wednesday",    
              "weekday_is_thursday",     
              "weekday_is_friday",   
              "weekday_is_saturday",     
              "weekday_is_sunday",
              "shares")

# remove the variables not considered in the analysis from the data frmae

online_news_df <- online_news[keepvars]


# convert the dummy variables to categorical variables for the data channel

online_news_df$news_channel <- NA 
online_news_df$news_channel[online_news_df$data_channel_is_lifestyle==1] <- "Lifestyle"
online_news_df$news_channel[online_news_df$data_channel_is_entertainment==1] <- "Entertainment"
online_news_df$news_channel[online_news_df$data_channel_is_bus==1] <- "Business"
online_news_df$news_channel[online_news_df$data_channel_is_socmed==1] <- "Social Media"
online_news_df$news_channel[online_news_df$data_channel_is_tech==1] <- "Technology"
online_news_df$news_channel[online_news_df$data_channel_is_world==1] <- "World"

# Create the News Channel variable 
online_news_df$news_channel <-  factor(online_news_df$news_channel, 
                                       levels = c("Business", 
                                                  "Entertainment", 
                                                  "Lifestyle", 
                                                  "Technology", 
                                                  "World",
                                                  "Social Media"))

# convert the dummy variables to categorical variables for the day of the week the article was published

online_news_df$day_published <- NA
online_news_df$day_published [online_news_df$weekday_is_monday==1] <- "Monday"
online_news_df$day_published [online_news_df$weekday_is_tuesday==1] <- "Tuesday"
online_news_df$day_published [online_news_df$weekday_is_wednesday==1] <- "Wednesday"
online_news_df$day_published [online_news_df$weekday_is_thursday==1] <- "Thursday"
online_news_df$day_published [online_news_df$weekday_is_friday==1] <- "Friday"
online_news_df$day_published [online_news_df$weekday_is_saturday==1] <- "Saturday"
online_news_df$day_published [online_news_df$weekday_is_sunday==1] <- "Sunday"

# create the variable for the day, date, and year of publication

online_news_df$day_published <- factor(online_news_df$day_published, 
                                       levels = c( "Monday", "Tuesday", "Wednesday", "Thursday",
                                                   "Friday", "Saturday", "Sunday"))

online_news_df$date_published <- ymd(substr(online_news_df$url, 21, 30))
online_news_df$year_published <- as.numeric(substr(online_news_df$url, 21, 24))


## drop unused variables

removevars <- c("data_channel_is_lifestyle",
                "data_channel_is_entertainment",
                "data_channel_is_bus",
                "data_channel_is_socmed",
                "data_channel_is_tech",  
                "data_channel_is_world",
                "weekday_is_monday",     
                "weekday_is_tuesday",    
                "weekday_is_wednesday",  
                "weekday_is_thursday",   
                "weekday_is_friday",     
                "weekday_is_saturday",   
                "weekday_is_sunday")

online_news_df <- online_news_df[, !(colnames(online_news_df) %in% removevars)]

## keep only complete cases within the dataset.  Some day and channel indicators are not valued in the original dataset.

online_news_df <- online_news_df[complete.cases(online_news_df), ]

#remove the original online news dataframe
online_news <- NULL

# Look at final Online News Popularity data frame
str(online_news_df)

## 'data.frame':    33510 obs. of  10 variables:
##  $ url             : chr  "http://mashable.com/2013/01/07/amazon-instant-video-browser/" "http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/" "http://mashable.com/2013/01/07/apple-40-billion-app-downloads/" "http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/" ...
##  $ n_tokens_title  : num  12 9 9 9 13 10 8 12 11 10 ...
##  $ n_tokens_content: num  219 255 211 531 1072 ...
##  $ num_imgs        : num  1 1 1 1 20 0 20 20 0 1 ...
##  $ num_videos      : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ shares          : int  593 711 1500 1200 505 855 556 891 3600 710 ...
##  $ news_channel    : Factor w/ 6 levels "Business","Entertainment",..: 2 1 1 2 4 4 3 4 4 5 ...
##  $ day_published   : Factor w/ 7 levels "Monday","Tuesday",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ date_published  : POSIXct, format: "2013-01-07" "2013-01-07" ...
##  $ year_published  : num  2013 2013 2013 2013 2013 ...

summary(online_news_df)

##      url            n_tokens_title  n_tokens_content    num_imgs      
##  Length:33510       Min.   : 2.00   Min.   :   0.0   Min.   :  0.000  
##  Class :character   1st Qu.: 9.00   1st Qu.: 272.0   1st Qu.:  1.000  
##  Mode  :character   Median :10.00   Median : 447.0   Median :  1.000  
##                     Mean   :10.42   Mean   : 585.4   Mean   :  3.959  
##                     3rd Qu.:12.00   3rd Qu.: 761.0   3rd Qu.:  3.000  
##                     Max.   :23.00   Max.   :8474.0   Max.   :128.000  
##                                                                       
##    num_videos          shares              news_channel    day_published 
##  Min.   : 0.0000   Min.   :     1   Business     :6258   Monday   :5761  
##  1st Qu.: 0.0000   1st Qu.:   930   Entertainment:7057   Tuesday  :6279  
##  Median : 0.0000   Median :  1400   Lifestyle    :2099   Wednesday:6352  
##  Mean   : 0.9984   Mean   :  2929   Technology   :7346   Thursday :6165  
##  3rd Qu.: 1.0000   3rd Qu.:  2500   World        :8427   Friday   :4735  
##  Max.   :75.0000   Max.   :690400   Social Media :2323   Saturday :2029  
##                                                          Sunday   :2189  
##  date_published                year_published
##  Min.   :2013-01-07 00:00:00   Min.   :2013  
##  1st Qu.:2013-07-16 00:00:00   1st Qu.:2013  
##  Median :2014-02-06 00:00:00   Median :2014  
##  Mean   :2014-01-19 13:35:05   Mean   :2014  
##  3rd Qu.:2014-07-27 00:00:00   3rd Qu.:2014  
##  Max.   :2014-12-27 00:00:00   Max.   :2014  
##

Final Data Frame

The following lists the variables derived or retained in the final data frame.

Data Dictionary

Variable Name	Description
`url`	URL of the article
`n_tokens_title`	Number of words in the title
`n_tokens_content`	Number of words in the title
`num_imgs`	Number of images
`num_videos`	Number of videos
`shares`	Number of shares (target)
`news_channel`	Factor: Business, Entertainment, Lifestyle, Technology, World, Social Media
`day_published`	Factor: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday
`year_published`	Year of article’s publication: 2013 or 2014
`date_published`	Date of article’s publication

Subsetting by Year

If we want create data frames by year, we can subset the data using the year_published:

online_news_df.2013 <- subset(online_news_df, year_published == 2013)
online_news_df.2014 <- subset(online_news_df, year_published == 2014)

# show summary statistics for 2013
summary(online_news_df.2013)

##      url            n_tokens_title   n_tokens_content    num_imgs      
##  Length:15192       Min.   : 2.000   Min.   :   0.0   Min.   :  0.000  
##  Class :character   1st Qu.: 9.000   1st Qu.: 247.0   1st Qu.:  1.000  
##  Mode  :character   Median :10.000   Median : 408.5   Median :  1.000  
##                     Mean   : 9.895   Mean   : 542.8   Mean   :  3.531  
##                     3rd Qu.:11.000   3rd Qu.: 696.0   3rd Qu.:  1.000  
##                     Max.   :18.000   Max.   :6505.0   Max.   :101.000  
##                                                                        
##    num_videos          shares              news_channel    day_published 
##  Min.   : 0.0000   Min.   :     1   Business     :3194   Monday   :2586  
##  1st Qu.: 0.0000   1st Qu.:   914   Entertainment:2862   Tuesday  :2860  
##  Median : 0.0000   Median :  1400   Lifestyle    :1191   Wednesday:2952  
##  Mean   : 0.8237   Mean   :  3036   Technology   :3942   Thursday :2801  
##  3rd Qu.: 1.0000   3rd Qu.:  2700   World        :2634   Friday   :2135  
##  Max.   :75.0000   Max.   :690400   Social Media :1369   Saturday : 908  
##                                                          Sunday   : 950  
##  date_published                year_published
##  Min.   :2013-01-07 00:00:00   Min.   :2013  
##  1st Qu.:2013-03-29 00:00:00   1st Qu.:2013  
##  Median :2013-06-26 00:00:00   Median :2013  
##  Mean   :2013-06-28 09:05:46   Mean   :2013  
##  3rd Qu.:2013-09-25 00:00:00   3rd Qu.:2013  
##  Max.   :2013-12-31 00:00:00   Max.   :2013  
##

# show summary statistics for 2014
summary(online_news_df.2014)

##      url            n_tokens_title  n_tokens_content    num_imgs      
##  Length:18318       Min.   : 3.00   Min.   :   0.0   Min.   :  0.000  
##  Class :character   1st Qu.: 9.00   1st Qu.: 294.0   1st Qu.:  1.000  
##  Mode  :character   Median :11.00   Median : 482.0   Median :  1.000  
##                     Mean   :10.85   Mean   : 620.8   Mean   :  4.315  
##                     3rd Qu.:12.00   3rd Qu.: 814.0   3rd Qu.:  3.000  
##                     Max.   :23.00   Max.   :8474.0   Max.   :128.000  
##                                                                       
##    num_videos         shares                news_channel    day_published 
##  Min.   : 0.000   Min.   :     5.0   Business     :3064   Monday   :3175  
##  1st Qu.: 0.000   1st Qu.:   940.2   Entertainment:4195   Tuesday  :3419  
##  Median : 0.000   Median :  1400.0   Lifestyle    : 908   Wednesday:3400  
##  Mean   : 1.143   Mean   :  2839.5   Technology   :3404   Thursday :3364  
##  3rd Qu.: 1.000   3rd Qu.:  2400.0   World        :5793   Friday   :2600  
##  Max.   :75.000   Max.   :663600.0   Social Media : 954   Saturday :1121  
##                                                           Sunday   :1239  
##  date_published                year_published
##  Min.   :2014-01-01 00:00:00   Min.   :2014  
##  1st Qu.:2014-04-15 00:00:00   1st Qu.:2014  
##  Median :2014-07-14 00:00:00   Median :2014  
##  Mean   :2014-07-08 17:42:02   Mean   :2014  
##  3rd Qu.:2014-10-06 00:00:00   3rd Qu.:2014  
##  Max.   :2014-12-27 00:00:00   Max.   :2014  
##

DATA 607 Week 2 Assignment - Subsetting Datasets

Keith Folsom

February 6, 2016

Introduction

About the Data

Loading and Preparing the UCI OnlineNewsPopularity Dataset

Data Dictionary

Process the dataset into the final data frame

Final Data Frame

Data Dictionary

Subsetting by Year