Introduction

In today’s digital era, majority of us seek and share information from news and articles with the help of an internet.It is much easier to figure out which articles or news are most liked, viewed and trending on the internet or various social media platforms. Trending news or articles can make one popular and famous in a matter of few hours.It can also help social media outlets to attract more and more people and generate profits. Therefore, it is important to evaluate and see which features make news or articles gain popularity. In this regard, this project aims to predict the popularity of an online article by using several features comprehensively explained below. Furthermore we will select the best prediction machine learning model based on Root mean Square Error (RMSE) and R2.

Data Set Description and Preparation

This data set summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks. Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field)

The data set has been retrieved from the UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/Online%2BNews%2BPopularity. We would like to predict the number of shares(shares).

lets take a look at the structure of our data set by applying the glimpse function.

my_data<-read.csv("Regression.csv")
glimpse(my_data)
## Rows: 39,644
## Columns: 61
## $ url                           <chr> "http://mashable.com/2013/01/07/amazon-i~
## $ timedelta                     <int> 731, 731, 731, 731, 731, 731, 731, 731, ~
## $ n_tokens_title                <int> 12, 9, 9, 9, 13, 10, 8, 12, 11, 10, 9, 1~
## $ n_tokens_content              <int> 219, 255, 211, 531, 1072, 370, 960, 989,~
## $ n_unique_tokens               <dbl> 0.6635945, 0.6047431, 0.5751295, 0.50378~
## $ n_non_stop_words              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ n_non_stop_unique_tokens      <dbl> 0.8153846, 0.7919463, 0.6638655, 0.66563~
## $ num_hrefs                     <int> 4, 3, 3, 9, 19, 2, 21, 20, 2, 4, 11, 7, ~
## $ num_self_hrefs                <int> 2, 1, 1, 0, 19, 2, 20, 20, 0, 1, 0, 0, 2~
## $ num_imgs                      <int> 1, 1, 1, 1, 20, 0, 20, 20, 0, 1, 1, 1, 1~
## $ num_videos                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2~
## $ average_token_length          <dbl> 4.680365, 4.913725, 4.393365, 4.404896, ~
## $ num_keywords                  <int> 5, 4, 6, 7, 7, 9, 10, 9, 7, 5, 8, 7, 8, ~
## $ data_channel_is_lifestyle     <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0~
## $ data_channel_is_entertainment <int> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ data_channel_is_bus           <int> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ data_channel_is_socmed        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ data_channel_is_tech          <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0~
## $ data_channel_is_world         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0~
## $ kw_min_min                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_min                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_min                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_min_max                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_max                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_max                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_min_avg                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_avg                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_avg                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ self_reference_min_shares     <dbl> 496, 0, 918, 0, 545, 8500, 545, 545, 0, ~
## $ self_reference_max_shares     <dbl> 496, 0, 918, 0, 16000, 8500, 16000, 1600~
## $ self_reference_avg_sharess    <dbl> 496.000, 0.000, 918.000, 0.000, 3151.158~
## $ weekday_is_monday             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ weekday_is_tuesday            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ weekday_is_wednesday          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ weekday_is_thursday           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ weekday_is_friday             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ weekday_is_saturday           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ weekday_is_sunday             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ is_weekend                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ LDA_00                        <dbl> 0.50033120, 0.79975569, 0.21779229, 0.02~
## $ LDA_01                        <dbl> 0.37827893, 0.05004667, 0.03333446, 0.41~
## $ LDA_02                        <dbl> 0.04000468, 0.05009625, 0.03335142, 0.49~
## $ LDA_03                        <dbl> 0.04126265, 0.05010067, 0.03333354, 0.02~
## $ LDA_04                        <dbl> 0.04012254, 0.05000071, 0.68218829, 0.02~
## $ global_subjectivity           <dbl> 0.5216171, 0.3412458, 0.7022222, 0.42984~
## $ global_sentiment_polarity     <dbl> 0.09256198, 0.14894781, 0.32333333, 0.10~
## $ global_rate_positive_words    <dbl> 0.04566210, 0.04313725, 0.05687204, 0.04~
## $ global_rate_negative_words    <dbl> 0.013698630, 0.015686275, 0.009478673, 0~
## $ rate_positive_words           <dbl> 0.7692308, 0.7333333, 0.8571429, 0.66666~
## $ rate_negative_words           <dbl> 0.2307692, 0.2666667, 0.1428571, 0.33333~
## $ avg_positive_polarity         <dbl> 0.3786364, 0.2869146, 0.4958333, 0.38596~
## $ min_positive_polarity         <dbl> 0.10000000, 0.03333333, 0.10000000, 0.13~
## $ max_positive_polarity         <dbl> 0.7000000, 0.7000000, 1.0000000, 0.80000~
## $ avg_negative_polarity         <dbl> -0.3500000, -0.1187500, -0.4666667, -0.3~
## $ min_negative_polarity         <dbl> -0.6000000, -0.1250000, -0.8000000, -0.6~
## $ max_negative_polarity         <dbl> -0.2000000, -0.1000000, -0.1333333, -0.1~
## $ title_subjectivity            <dbl> 0.5000000, 0.0000000, 0.0000000, 0.00000~
## $ title_sentiment_polarity      <dbl> -0.1875000, 0.0000000, 0.0000000, 0.0000~
## $ abs_title_subjectivity        <dbl> 0.00000000, 0.50000000, 0.50000000, 0.50~
## $ abs_title_sentiment_polarity  <dbl> 0.1875000, 0.0000000, 0.0000000, 0.00000~
## $ shares                        <int> 593, 711, 1500, 1200, 505, 855, 556, 891~

In total we have 39,644 observations and 61 features(columns). For estimating regression models we don’t require first two features i.e #url and # time delta. Lets remove them from our data set and save the changes to our final data set.

#Removing the first two non predictive features from our data set.
my_data <- my_data[, c(-1,-2)]

In our data set we have few binary variables. For the modeling purpose and to make our data more understandable lets convert those binary variables (0,1) to nominal variables (Yes, No). First we will make a list of numerical feature then from that list we will filter out binary variables and save it as binary list.

a <- sapply(my_data,
            function(x)
              unique(x)%>%
              length()) %>%
  sort()

binary.list <-names(a[a==2])

binary.list
##  [1] "data_channel_is_lifestyle"     "data_channel_is_entertainment"
##  [3] "data_channel_is_bus"           "data_channel_is_socmed"       
##  [5] "data_channel_is_tech"          "data_channel_is_world"        
##  [7] "weekday_is_monday"             "weekday_is_tuesday"           
##  [9] "weekday_is_wednesday"          "weekday_is_thursday"          
## [11] "weekday_is_friday"             "weekday_is_saturday"          
## [13] "weekday_is_sunday"             "is_weekend"
transform01 <- function(my_data, binary.list){
  for (i in 1:(length(binary.list))) {
    my_data[, binary.list[i]] <-
      ifelse(my_data[, binary.list[i]]==1, 'Yes', 'No')

  }
  return(my_data)
}


my_data <- transform01(my_data, binary.list)

glimpse(my_data)
## Rows: 39,644
## Columns: 59
## $ n_tokens_title                <int> 12, 9, 9, 9, 13, 10, 8, 12, 11, 10, 9, 1~
## $ n_tokens_content              <int> 219, 255, 211, 531, 1072, 370, 960, 989,~
## $ n_unique_tokens               <dbl> 0.6635945, 0.6047431, 0.5751295, 0.50378~
## $ n_non_stop_words              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ n_non_stop_unique_tokens      <dbl> 0.8153846, 0.7919463, 0.6638655, 0.66563~
## $ num_hrefs                     <int> 4, 3, 3, 9, 19, 2, 21, 20, 2, 4, 11, 7, ~
## $ num_self_hrefs                <int> 2, 1, 1, 0, 19, 2, 20, 20, 0, 1, 0, 0, 2~
## $ num_imgs                      <int> 1, 1, 1, 1, 20, 0, 20, 20, 0, 1, 1, 1, 1~
## $ num_videos                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2~
## $ average_token_length          <dbl> 4.680365, 4.913725, 4.393365, 4.404896, ~
## $ num_keywords                  <int> 5, 4, 6, 7, 7, 9, 10, 9, 7, 5, 8, 7, 8, ~
## $ data_channel_is_lifestyle     <chr> "No", "No", "No", "No", "No", "No", "Yes~
## $ data_channel_is_entertainment <chr> "Yes", "No", "No", "Yes", "No", "No", "N~
## $ data_channel_is_bus           <chr> "No", "Yes", "Yes", "No", "No", "No", "N~
## $ data_channel_is_socmed        <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ data_channel_is_tech          <chr> "No", "No", "No", "No", "Yes", "Yes", "N~
## $ data_channel_is_world         <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ kw_min_min                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_min                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_min                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_min_max                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_max                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_max                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_min_avg                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_avg                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_avg                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ self_reference_min_shares     <dbl> 496, 0, 918, 0, 545, 8500, 545, 545, 0, ~
## $ self_reference_max_shares     <dbl> 496, 0, 918, 0, 16000, 8500, 16000, 1600~
## $ self_reference_avg_sharess    <dbl> 496.000, 0.000, 918.000, 0.000, 3151.158~
## $ weekday_is_monday             <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"~
## $ weekday_is_tuesday            <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ weekday_is_wednesday          <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ weekday_is_thursday           <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ weekday_is_friday             <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ weekday_is_saturday           <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ weekday_is_sunday             <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ is_weekend                    <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ LDA_00                        <dbl> 0.50033120, 0.79975569, 0.21779229, 0.02~
## $ LDA_01                        <dbl> 0.37827893, 0.05004667, 0.03333446, 0.41~
## $ LDA_02                        <dbl> 0.04000468, 0.05009625, 0.03335142, 0.49~
## $ LDA_03                        <dbl> 0.04126265, 0.05010067, 0.03333354, 0.02~
## $ LDA_04                        <dbl> 0.04012254, 0.05000071, 0.68218829, 0.02~
## $ global_subjectivity           <dbl> 0.5216171, 0.3412458, 0.7022222, 0.42984~
## $ global_sentiment_polarity     <dbl> 0.09256198, 0.14894781, 0.32333333, 0.10~
## $ global_rate_positive_words    <dbl> 0.04566210, 0.04313725, 0.05687204, 0.04~
## $ global_rate_negative_words    <dbl> 0.013698630, 0.015686275, 0.009478673, 0~
## $ rate_positive_words           <dbl> 0.7692308, 0.7333333, 0.8571429, 0.66666~
## $ rate_negative_words           <dbl> 0.2307692, 0.2666667, 0.1428571, 0.33333~
## $ avg_positive_polarity         <dbl> 0.3786364, 0.2869146, 0.4958333, 0.38596~
## $ min_positive_polarity         <dbl> 0.10000000, 0.03333333, 0.10000000, 0.13~
## $ max_positive_polarity         <dbl> 0.7000000, 0.7000000, 1.0000000, 0.80000~
## $ avg_negative_polarity         <dbl> -0.3500000, -0.1187500, -0.4666667, -0.3~
## $ min_negative_polarity         <dbl> -0.6000000, -0.1250000, -0.8000000, -0.6~
## $ max_negative_polarity         <dbl> -0.2000000, -0.1000000, -0.1333333, -0.1~
## $ title_subjectivity            <dbl> 0.5000000, 0.0000000, 0.0000000, 0.00000~
## $ title_sentiment_polarity      <dbl> -0.1875000, 0.0000000, 0.0000000, 0.0000~
## $ abs_title_subjectivity        <dbl> 0.00000000, 0.50000000, 0.50000000, 0.50~
## $ abs_title_sentiment_polarity  <dbl> 0.1875000, 0.0000000, 0.0000000, 0.00000~
## $ shares                        <int> 593, 711, 1500, 1200, 505, 855, 556, 891~

Before applying various feature selection methods its better to quickly go through entire columns and see if any column has any missing values.

colSums(is.na(my_data)) %>% sort() 
##                n_tokens_title              n_tokens_content 
##                             0                             0 
##               n_unique_tokens              n_non_stop_words 
##                             0                             0 
##      n_non_stop_unique_tokens                     num_hrefs 
##                             0                             0 
##                num_self_hrefs                      num_imgs 
##                             0                             0 
##                    num_videos          average_token_length 
##                             0                             0 
##                  num_keywords     data_channel_is_lifestyle 
##                             0                             0 
## data_channel_is_entertainment           data_channel_is_bus 
##                             0                             0 
##        data_channel_is_socmed          data_channel_is_tech 
##                             0                             0 
##         data_channel_is_world                    kw_min_min 
##                             0                             0 
##                    kw_max_min                    kw_avg_min 
##                             0                             0 
##                    kw_min_max                    kw_max_max 
##                             0                             0 
##                    kw_avg_max                    kw_min_avg 
##                             0                             0 
##                    kw_max_avg                    kw_avg_avg 
##                             0                             0 
##     self_reference_min_shares     self_reference_max_shares 
##                             0                             0 
##    self_reference_avg_sharess             weekday_is_monday 
##                             0                             0 
##            weekday_is_tuesday          weekday_is_wednesday 
##                             0                             0 
##           weekday_is_thursday             weekday_is_friday 
##                             0                             0 
##           weekday_is_saturday             weekday_is_sunday 
##                             0                             0 
##                    is_weekend                        LDA_00 
##                             0                             0 
##                        LDA_01                        LDA_02 
##                             0                             0 
##                        LDA_03                        LDA_04 
##                             0                             0 
##           global_subjectivity     global_sentiment_polarity 
##                             0                             0 
##    global_rate_positive_words    global_rate_negative_words 
##                             0                             0 
##           rate_positive_words           rate_negative_words 
##                             0                             0 
##         avg_positive_polarity         min_positive_polarity 
##                             0                             0 
##         max_positive_polarity         avg_negative_polarity 
##                             0                             0 
##         min_negative_polarity         max_negative_polarity 
##                             0                             0 
##            title_subjectivity      title_sentiment_polarity 
##                             0                             0 
##        abs_title_subjectivity  abs_title_sentiment_polarity 
##                             0                             0 
##                        shares 
##                             0
# no missing value in any column

Lets make a list of categorical variables and then convert those character variables to factor variables and store them as factors. Also make a list of numeric features and store them in an vector named shares_numerical.

shares_categorical <- sapply(my_data, is.character) %>% which() %>% names()

#converting all categorical variables to factors by applying a loop

for (variable in shares_categorical) {
  my_data[[variable]] <- as.factor(my_data[[variable]])
}

#list of numeric features in a separate vector 

shares_numerical <- sapply(my_data, is.numeric) %>% which () %>% names()

glimpse(my_data)
## Rows: 39,644
## Columns: 59
## $ n_tokens_title                <int> 12, 9, 9, 9, 13, 10, 8, 12, 11, 10, 9, 1~
## $ n_tokens_content              <int> 219, 255, 211, 531, 1072, 370, 960, 989,~
## $ n_unique_tokens               <dbl> 0.6635945, 0.6047431, 0.5751295, 0.50378~
## $ n_non_stop_words              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ n_non_stop_unique_tokens      <dbl> 0.8153846, 0.7919463, 0.6638655, 0.66563~
## $ num_hrefs                     <int> 4, 3, 3, 9, 19, 2, 21, 20, 2, 4, 11, 7, ~
## $ num_self_hrefs                <int> 2, 1, 1, 0, 19, 2, 20, 20, 0, 1, 0, 0, 2~
## $ num_imgs                      <int> 1, 1, 1, 1, 20, 0, 20, 20, 0, 1, 1, 1, 1~
## $ num_videos                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2~
## $ average_token_length          <dbl> 4.680365, 4.913725, 4.393365, 4.404896, ~
## $ num_keywords                  <int> 5, 4, 6, 7, 7, 9, 10, 9, 7, 5, 8, 7, 8, ~
## $ data_channel_is_lifestyle     <fct> No, No, No, No, No, No, Yes, No, No, No,~
## $ data_channel_is_entertainment <fct> Yes, No, No, Yes, No, No, No, No, No, No~
## $ data_channel_is_bus           <fct> No, Yes, Yes, No, No, No, No, No, No, No~
## $ data_channel_is_socmed        <fct> No, No, No, No, No, No, No, No, No, No, ~
## $ data_channel_is_tech          <fct> No, No, No, No, Yes, Yes, No, Yes, Yes, ~
## $ data_channel_is_world         <fct> No, No, No, No, No, No, No, No, No, Yes,~
## $ kw_min_min                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_min                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_min                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_min_max                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_max                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_max                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_min_avg                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_avg                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_avg                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ self_reference_min_shares     <dbl> 496, 0, 918, 0, 545, 8500, 545, 545, 0, ~
## $ self_reference_max_shares     <dbl> 496, 0, 918, 0, 16000, 8500, 16000, 1600~
## $ self_reference_avg_sharess    <dbl> 496.000, 0.000, 918.000, 0.000, 3151.158~
## $ weekday_is_monday             <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, ~
## $ weekday_is_tuesday            <fct> No, No, No, No, No, No, No, No, No, No, ~
## $ weekday_is_wednesday          <fct> No, No, No, No, No, No, No, No, No, No, ~
## $ weekday_is_thursday           <fct> No, No, No, No, No, No, No, No, No, No, ~
## $ weekday_is_friday             <fct> No, No, No, No, No, No, No, No, No, No, ~
## $ weekday_is_saturday           <fct> No, No, No, No, No, No, No, No, No, No, ~
## $ weekday_is_sunday             <fct> No, No, No, No, No, No, No, No, No, No, ~
## $ is_weekend                    <fct> No, No, No, No, No, No, No, No, No, No, ~
## $ LDA_00                        <dbl> 0.50033120, 0.79975569, 0.21779229, 0.02~
## $ LDA_01                        <dbl> 0.37827893, 0.05004667, 0.03333446, 0.41~
## $ LDA_02                        <dbl> 0.04000468, 0.05009625, 0.03335142, 0.49~
## $ LDA_03                        <dbl> 0.04126265, 0.05010067, 0.03333354, 0.02~
## $ LDA_04                        <dbl> 0.04012254, 0.05000071, 0.68218829, 0.02~
## $ global_subjectivity           <dbl> 0.5216171, 0.3412458, 0.7022222, 0.42984~
## $ global_sentiment_polarity     <dbl> 0.09256198, 0.14894781, 0.32333333, 0.10~
## $ global_rate_positive_words    <dbl> 0.04566210, 0.04313725, 0.05687204, 0.04~
## $ global_rate_negative_words    <dbl> 0.013698630, 0.015686275, 0.009478673, 0~
## $ rate_positive_words           <dbl> 0.7692308, 0.7333333, 0.8571429, 0.66666~
## $ rate_negative_words           <dbl> 0.2307692, 0.2666667, 0.1428571, 0.33333~
## $ avg_positive_polarity         <dbl> 0.3786364, 0.2869146, 0.4958333, 0.38596~
## $ min_positive_polarity         <dbl> 0.10000000, 0.03333333, 0.10000000, 0.13~
## $ max_positive_polarity         <dbl> 0.7000000, 0.7000000, 1.0000000, 0.80000~
## $ avg_negative_polarity         <dbl> -0.3500000, -0.1187500, -0.4666667, -0.3~
## $ min_negative_polarity         <dbl> -0.6000000, -0.1250000, -0.8000000, -0.6~
## $ max_negative_polarity         <dbl> -0.2000000, -0.1000000, -0.1333333, -0.1~
## $ title_subjectivity            <dbl> 0.5000000, 0.0000000, 0.0000000, 0.00000~
## $ title_sentiment_polarity      <dbl> -0.1875000, 0.0000000, 0.0000000, 0.0000~
## $ abs_title_subjectivity        <dbl> 0.00000000, 0.50000000, 0.50000000, 0.50~
## $ abs_title_sentiment_polarity  <dbl> 0.1875000, 0.0000000, 0.0000000, 0.00000~
## $ shares                        <int> 593, 711, 1500, 1200, 505, 855, 556, 891~
#All the character variables are now converted to factor variables.

Data Partitioning

Lets divide the data into training and test sample. There are few functions that would help our cause but we will use createDataPartition() function from the caret package.

set.seed(987654321)

data_which_train <- createDataPartition(my_data$shares, # target variable
                                          # share of the training sample
                                          p = 0.7,
                                          # should result be a list?
                                          list = FALSE)

It is a vector of numbers - indexes of 70% of observations are selected , CreateDataPartition function does not create training or test sample just select number of observations for training sample. We need to apply this index for data division. lets create the division. Furthermore lets check the frequency distribution of the target variable i.e number of shares (shares)

train_data <- my_data[data_which_train, ]
test_data <- my_data[-data_which_train, ]

summary(train_data$shares)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       4     946    1400    3396    2800  843300
summary(test_data$shares)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     946    1400    3393    2800  441000

Apart from minimum and maximum values other statistics are almost similar. Its important to keep in mind that all transformations taking into account of variable distributions must be done on training data.

Lets check the distribution of our outcome variable i.e number of shares (shares).

ggplot(my_data,
       aes(x = shares)) +
  geom_histogram(fill = "blue",
                bins= 100) +
  theme_bw()+
  labs(title = "Histogram of Outcome variable (Shares)") +
  theme(plot.title = element_text(hjust = 0.6))

As we can see the distribution of our outcome variable is rightly skewed. For modeling purpose, normally distributed data is preferred. One way to make it look normal is to take log and see if it helps.

ggplot(my_data,
       aes(x = log(shares))) +
  geom_histogram(fill = "red",
                 bins = 100) +
  theme_bw()

The above histogram clearly indicates that log transformation does help a little bit as it looks more better and symmetrical in comparison to without log transformation.

We can even try boxcox and yeojhonson transformation method. Both transformations give the same lamda value. We could also see that the distribution of boxcox transformation isn’t really symmetrical so we will stick with the log transformation.

shares_boxcox <- bestNormalize::boxcox(train_data$shares)
shares_boxcox$lambda
## [1] -0.2345715
sharesboxcox <- shares_boxcox$x.t
hist(sharesboxcox)

Initial Feature Selection

Feature selection is the most vital part for machine learning algorithms. Because inclusion of redundant and weakly correlated variables with the outcome variable do not add anything to our model results. It could reduce the model accuracy like in classification task or could increase errors in regression task. Therefore, it is highly recommended to apply various feature selection techniques to identify redundant features and remove them. First, we will start with examining relationship of predictors with the outcome variable. And we will observe if we have strong relationship among predictors as well i.e collinearity . We must make sure that they are not strongly correlated with each other. We have both categorical and numeric features so we will do it separately.

Correlation Among Numeric Features

shares_correlation <-
  cor(train_data[,shares_numerical],
      use = "pairwise.complete.obs")

shares_correlation %>% round(digits = 2)
##                              n_tokens_title n_tokens_content n_unique_tokens
## n_tokens_title                         1.00             0.02           -0.01
## n_tokens_content                       0.02             1.00            0.00
## n_unique_tokens                       -0.01             0.00            1.00
## n_non_stop_words                      -0.01             0.02            1.00
## n_non_stop_unique_tokens              -0.01             0.00            1.00
## num_hrefs                             -0.05             0.42            0.00
## num_self_hrefs                        -0.02             0.30            0.01
## num_imgs                              -0.01             0.34            0.03
## num_videos                             0.05             0.10            0.00
## average_token_length                  -0.08             0.16            0.02
## num_keywords                          -0.01             0.08            0.00
## kw_min_min                            -0.11            -0.06            0.00
## kw_max_min                             0.00             0.00            0.00
## kw_avg_min                            -0.03             0.00            0.00
## kw_min_max                             0.01            -0.03            0.00
## kw_max_max                             0.12             0.06            0.00
## kw_avg_max                             0.11            -0.10            0.00
## kw_min_avg                            -0.01            -0.02            0.01
## kw_max_avg                             0.01            -0.03            0.00
## kw_avg_avg                             0.00            -0.08            0.00
## self_reference_min_shares              0.00            -0.03            0.00
## self_reference_max_shares              0.00             0.03            0.00
## self_reference_avg_sharess             0.00            -0.01            0.00
## LDA_00                                -0.07             0.02            0.00
## LDA_01                                 0.06            -0.01            0.00
## LDA_02                                 0.04             0.08           -0.01
## LDA_03                                 0.04            -0.14            0.00
## LDA_04                                -0.06             0.04           -0.01
## global_subjectivity                   -0.06             0.13           -0.01
## global_sentiment_polarity             -0.08             0.02            0.00
## global_rate_positive_words            -0.07             0.13            0.00
## global_rate_negative_words             0.02             0.12            0.00
## rate_positive_words                   -0.07             0.10           -0.01
## rate_negative_words                    0.04             0.10            0.00
## avg_positive_polarity                 -0.05             0.13           -0.01
## min_positive_polarity                 -0.02            -0.26            0.01
## max_positive_polarity                 -0.02             0.41           -0.01
## avg_negative_polarity                 -0.02            -0.13            0.00
## min_negative_polarity                 -0.03            -0.45            0.01
## max_negative_polarity                  0.01             0.23            0.00
## title_subjectivity                     0.08             0.00           -0.01
## title_sentiment_polarity               0.00             0.02            0.00
## abs_title_subjectivity                -0.15             0.01           -0.01
## abs_title_sentiment_polarity           0.04             0.02            0.00
## shares                                 0.01             0.01            0.00
##                              n_non_stop_words n_non_stop_unique_tokens
## n_tokens_title                          -0.01                    -0.01
## n_tokens_content                         0.02                     0.00
## n_unique_tokens                          1.00                     1.00
## n_non_stop_words                         1.00                     1.00
## n_non_stop_unique_tokens                 1.00                     1.00
## num_hrefs                                0.00                     0.00
## num_self_hrefs                           0.01                     0.01
## num_imgs                                 0.03                     0.02
## num_videos                               0.00                     0.00
## average_token_length                     0.03                     0.03
## num_keywords                             0.00                     0.00
## kw_min_min                               0.00                     0.00
## kw_max_min                               0.00                     0.00
## kw_avg_min                               0.00                     0.00
## kw_min_max                               0.00                     0.00
## kw_max_max                               0.00                     0.00
## kw_avg_max                               0.00                     0.00
## kw_min_avg                               0.00                     0.00
## kw_max_avg                               0.00                     0.00
## kw_avg_avg                               0.00                    -0.01
## self_reference_min_shares                0.00                     0.00
## self_reference_max_shares                0.00                     0.00
## self_reference_avg_sharess               0.00                     0.00
## LDA_00                                   0.00                     0.00
## LDA_01                                   0.00                     0.00
## LDA_02                                   0.00                    -0.01
## LDA_03                                  -0.01                    -0.01
## LDA_04                                   0.00                     0.00
## global_subjectivity                     -0.01                     0.00
## global_sentiment_polarity                0.00                     0.00
## global_rate_positive_words               0.00                     0.00
## global_rate_negative_words               0.00                     0.00
## rate_positive_words                      0.00                     0.00
## rate_negative_words                      0.00                     0.00
## avg_positive_polarity                    0.00                     0.00
## min_positive_polarity                    0.00                     0.00
## max_positive_polarity                    0.00                    -0.01
## avg_negative_polarity                    0.00                     0.00
## min_negative_polarity                    0.00                     0.01
## max_negative_polarity                    0.00                     0.00
## title_subjectivity                      -0.01                    -0.01
## title_sentiment_polarity                 0.00                     0.00
## abs_title_subjectivity                  -0.01                    -0.01
## abs_title_sentiment_polarity             0.00                    -0.01
## shares                                   0.00                     0.00
##                              num_hrefs num_self_hrefs num_imgs num_videos
## n_tokens_title                   -0.05          -0.02    -0.01       0.05
## n_tokens_content                  0.42           0.30     0.34       0.10
## n_unique_tokens                   0.00           0.01     0.03       0.00
## n_non_stop_words                  0.00           0.01     0.03       0.00
## n_non_stop_unique_tokens          0.00           0.01     0.02       0.00
## num_hrefs                         1.00           0.39     0.35       0.11
## num_self_hrefs                    0.39           1.00     0.24       0.08
## num_imgs                          0.35           0.24     1.00      -0.07
## num_videos                        0.11           0.08    -0.07       1.00
## average_token_length              0.22           0.13     0.04       0.00
## num_keywords                      0.13           0.10     0.09      -0.02
## kw_min_min                       -0.05          -0.02    -0.04       0.01
## kw_max_min                        0.02           0.00     0.01       0.01
## kw_avg_min                        0.01           0.00     0.00      -0.01
## kw_min_max                       -0.02          -0.03     0.00       0.04
## kw_max_max                        0.06           0.00     0.05       0.00
## kw_avg_max                       -0.02          -0.07     0.00       0.14
## kw_min_avg                        0.06           0.04     0.09       0.07
## kw_max_avg                        0.07          -0.03     0.07       0.05
## kw_avg_avg                        0.12          -0.02     0.15       0.12
## self_reference_min_shares        -0.01          -0.03     0.01       0.00
## self_reference_max_shares         0.08           0.13     0.04       0.08
## self_reference_avg_sharess        0.03           0.02     0.02       0.04
## LDA_00                           -0.03          -0.02    -0.15      -0.07
## LDA_01                           -0.05           0.00     0.05       0.02
## LDA_02                           -0.01          -0.09    -0.10      -0.09
## LDA_03                            0.13          -0.03     0.21       0.27
## LDA_04                           -0.06           0.14    -0.02      -0.13
## global_subjectivity               0.20           0.11     0.08       0.08
## global_sentiment_polarity         0.09           0.09     0.03      -0.03
## global_rate_positive_words        0.06           0.12    -0.04       0.08
## global_rate_negative_words        0.03           0.01     0.02       0.19
## rate_positive_words               0.10           0.14    -0.02      -0.05
## rate_negative_words               0.06          -0.01     0.04       0.07
## avg_positive_polarity             0.18           0.10     0.10       0.10
## min_positive_polarity            -0.08          -0.08    -0.02      -0.01
## max_positive_polarity             0.28           0.19     0.16       0.13
## avg_negative_polarity            -0.15          -0.06    -0.07      -0.12
## min_negative_polarity            -0.26          -0.12    -0.13      -0.14
## max_negative_polarity             0.06           0.04     0.04       0.03
## title_subjectivity                0.04          -0.01     0.06       0.07
## title_sentiment_polarity          0.04           0.03     0.05       0.02
## abs_title_subjectivity            0.01           0.01    -0.01      -0.03
## abs_title_sentiment_polarity      0.06           0.00     0.07       0.06
## shares                            0.04           0.00     0.03       0.02
##                              average_token_length num_keywords kw_min_min
## n_tokens_title                              -0.08        -0.01      -0.11
## n_tokens_content                             0.16         0.08      -0.06
## n_unique_tokens                              0.02         0.00       0.00
## n_non_stop_words                             0.03         0.00       0.00
## n_non_stop_unique_tokens                     0.03         0.00       0.00
## num_hrefs                                    0.22         0.13      -0.05
## num_self_hrefs                               0.13         0.10      -0.02
## num_imgs                                     0.04         0.09      -0.04
## num_videos                                   0.00        -0.02       0.01
## average_token_length                         1.00        -0.02       0.03
## num_keywords                                -0.02         1.00       0.00
## kw_min_min                                   0.03         0.00       1.00
## kw_max_min                                  -0.01         0.08       0.01
## kw_avg_min                                   0.01         0.08       0.12
## kw_min_max                                  -0.04        -0.28      -0.06
## kw_max_max                                  -0.04         0.01      -0.85
## kw_avg_max                                  -0.16        -0.33      -0.50
## kw_min_avg                                  -0.08        -0.27      -0.14
## kw_max_avg                                  -0.05         0.12      -0.08
## kw_avg_avg                                  -0.14         0.02      -0.19
## self_reference_min_shares                    0.04         0.00      -0.03
## self_reference_max_shares                    0.04         0.01      -0.04
## self_reference_avg_sharess                   0.04         0.01      -0.04
## LDA_00                                       0.08        -0.16       0.03
## LDA_01                                      -0.02        -0.06       0.01
## LDA_02                                       0.10        -0.01      -0.07
## LDA_03                                      -0.18         0.02      -0.01
## LDA_04                                       0.03         0.18       0.05
## global_subjectivity                          0.59         0.04       0.04
## global_sentiment_polarity                    0.18         0.08       0.08
## global_rate_positive_words                   0.32         0.05       0.10
## global_rate_negative_words                   0.23        -0.04      -0.01
## rate_positive_words                          0.57         0.03       0.09
## rate_negative_words                          0.32        -0.07      -0.05
## avg_positive_polarity                        0.54         0.04       0.04
## min_positive_polarity                        0.22        -0.01       0.02
## max_positive_polarity                        0.47         0.07       0.02
## avg_negative_polarity                       -0.32         0.02       0.02
## min_negative_polarity                       -0.28        -0.01       0.06
## max_negative_polarity                       -0.19         0.03      -0.03
## title_subjectivity                          -0.03         0.01       0.00
## title_sentiment_polarity                    -0.01         0.03       0.03
## abs_title_subjectivity                       0.02         0.00       0.01
## abs_title_sentiment_polarity                -0.03         0.02       0.01
## shares                                      -0.03         0.02       0.00
##                              kw_max_min kw_avg_min kw_min_max kw_max_max
## n_tokens_title                     0.00      -0.03       0.01       0.12
## n_tokens_content                   0.00       0.00      -0.03       0.06
## n_unique_tokens                    0.00       0.00       0.00       0.00
## n_non_stop_words                   0.00       0.00       0.00       0.00
## n_non_stop_unique_tokens           0.00       0.00       0.00       0.00
## num_hrefs                          0.02       0.01      -0.02       0.06
## num_self_hrefs                     0.00       0.00      -0.03       0.00
## num_imgs                           0.01       0.00       0.00       0.05
## num_videos                         0.01      -0.01       0.04       0.00
## average_token_length              -0.01       0.01      -0.04      -0.04
## num_keywords                       0.08       0.08      -0.28       0.01
## kw_min_min                         0.01       0.12      -0.06      -0.85
## kw_max_min                         1.00       0.94      -0.04       0.00
## kw_avg_min                         0.94       1.00      -0.07      -0.11
## kw_min_max                        -0.04      -0.07       1.00       0.07
## kw_max_max                         0.00      -0.11       0.07       1.00
## kw_avg_max                        -0.04      -0.14       0.42       0.56
## kw_min_avg                         0.02      -0.02       0.35       0.16
## kw_max_avg                         0.53       0.49       0.04       0.10
## kw_avg_avg                         0.38       0.34       0.18       0.23
## self_reference_min_shares          0.02       0.01       0.01       0.03
## self_reference_max_shares          0.04       0.03       0.03       0.05
## self_reference_avg_sharess         0.03       0.02       0.02       0.05
## LDA_00                             0.00       0.03       0.05      -0.03
## LDA_01                            -0.01      -0.01      -0.02      -0.01
## LDA_02                            -0.02      -0.04      -0.06       0.07
## LDA_03                             0.03       0.01       0.08       0.02
## LDA_04                             0.00       0.01      -0.06      -0.05
## global_subjectivity                0.01       0.02      -0.02      -0.04
## global_sentiment_polarity          0.01       0.03      -0.01      -0.09
## global_rate_positive_words         0.00       0.03      -0.01      -0.11
## global_rate_negative_words        -0.01      -0.01       0.00       0.02
## rate_positive_words                0.00       0.03      -0.03      -0.10
## rate_negative_words               -0.01      -0.02      -0.01       0.06
## avg_positive_polarity              0.01       0.02      -0.01      -0.04
## min_positive_polarity              0.01       0.01       0.00      -0.02
## max_positive_polarity              0.01       0.02      -0.02      -0.02
## avg_negative_polarity              0.00       0.00       0.00      -0.02
## min_negative_polarity              0.00       0.01       0.01      -0.06
## max_negative_polarity              0.01       0.00      -0.01       0.03
## title_subjectivity                 0.01       0.00       0.03       0.00
## title_sentiment_polarity           0.01       0.01       0.01      -0.03
## abs_title_subjectivity             0.00       0.01      -0.02      -0.01
## abs_title_sentiment_polarity       0.01       0.01       0.03      -0.01
## shares                             0.04       0.04       0.01       0.00
##                              kw_avg_max kw_min_avg kw_max_avg kw_avg_avg
## n_tokens_title                     0.11      -0.01       0.01       0.00
## n_tokens_content                  -0.10      -0.02      -0.03      -0.08
## n_unique_tokens                    0.00       0.01       0.00       0.00
## n_non_stop_words                   0.00       0.00       0.00       0.00
## n_non_stop_unique_tokens           0.00       0.00       0.00      -0.01
## num_hrefs                         -0.02       0.06       0.07       0.12
## num_self_hrefs                    -0.07       0.04      -0.03      -0.02
## num_imgs                           0.00       0.09       0.07       0.15
## num_videos                         0.14       0.07       0.05       0.12
## average_token_length              -0.16      -0.08      -0.05      -0.14
## num_keywords                      -0.33      -0.27       0.12       0.02
## kw_min_min                        -0.50      -0.14      -0.08      -0.19
## kw_max_min                        -0.04       0.02       0.53       0.38
## kw_avg_min                        -0.14      -0.02       0.49       0.34
## kw_min_max                         0.42       0.35       0.04       0.18
## kw_max_max                         0.56       0.16       0.10       0.23
## kw_avg_max                         1.00       0.41       0.15       0.44
## kw_min_avg                         0.41       1.00       0.10       0.47
## kw_max_avg                         0.15       0.10       1.00       0.81
## kw_avg_avg                         0.44       0.47       0.81       1.00
## self_reference_min_shares          0.05       0.05       0.10       0.12
## self_reference_max_shares          0.09       0.06       0.21       0.20
## self_reference_avg_sharess         0.09       0.06       0.19       0.19
## LDA_00                             0.12       0.00      -0.02      -0.04
## LDA_01                            -0.08      -0.01       0.00       0.01
## LDA_02                            -0.12      -0.13      -0.11      -0.28
## LDA_03                             0.30       0.19       0.22       0.45
## LDA_04                            -0.23      -0.05      -0.09      -0.16
## global_subjectivity               -0.06       0.02       0.04       0.07
## global_sentiment_polarity         -0.08       0.00       0.01       0.02
## global_rate_positive_words        -0.11       0.00      -0.01      -0.01
## global_rate_negative_words         0.01       0.02       0.01       0.03
## rate_positive_words               -0.16      -0.06      -0.04      -0.10
## rate_negative_words                0.00      -0.01      -0.01      -0.03
## avg_positive_polarity             -0.07       0.01       0.03       0.05
## min_positive_polarity              0.01       0.02       0.03       0.06
## max_positive_polarity             -0.10      -0.01       0.01       0.00
## avg_negative_polarity             -0.02      -0.02      -0.04      -0.07
## min_negative_polarity              0.01      -0.01      -0.02      -0.02
## max_negative_polarity             -0.02      -0.01      -0.02      -0.04
## title_subjectivity                 0.04       0.05       0.05       0.09
## title_sentiment_polarity           0.00       0.03       0.02       0.04
## abs_title_subjectivity            -0.02      -0.02      -0.01      -0.02
## abs_title_sentiment_polarity       0.03       0.06       0.05       0.09
## shares                             0.04       0.03       0.07       0.11
##                              self_reference_min_shares
## n_tokens_title                                    0.00
## n_tokens_content                                 -0.03
## n_unique_tokens                                   0.00
## n_non_stop_words                                  0.00
## n_non_stop_unique_tokens                          0.00
## num_hrefs                                        -0.01
## num_self_hrefs                                   -0.03
## num_imgs                                          0.01
## num_videos                                        0.00
## average_token_length                              0.04
## num_keywords                                      0.00
## kw_min_min                                       -0.03
## kw_max_min                                        0.02
## kw_avg_min                                        0.01
## kw_min_max                                        0.01
## kw_max_max                                        0.03
## kw_avg_max                                        0.05
## kw_min_avg                                        0.05
## kw_max_avg                                        0.10
## kw_avg_avg                                        0.12
## self_reference_min_shares                         1.00
## self_reference_max_shares                         0.45
## self_reference_avg_sharess                        0.79
## LDA_00                                            0.00
## LDA_01                                           -0.01
## LDA_02                                           -0.05
## LDA_03                                            0.04
## LDA_04                                            0.01
## global_subjectivity                               0.06
## global_sentiment_polarity                         0.01
## global_rate_positive_words                        0.01
## global_rate_negative_words                        0.02
## rate_positive_words                               0.02
## rate_negative_words                               0.02
## avg_positive_polarity                             0.05
## min_positive_polarity                             0.04
## max_positive_polarity                             0.01
## avg_negative_polarity                            -0.04
## min_negative_polarity                            -0.01
## max_negative_polarity                            -0.05
## title_subjectivity                                0.01
## title_sentiment_polarity                          0.01
## abs_title_subjectivity                            0.00
## abs_title_sentiment_polarity                      0.01
## shares                                            0.07
##                              self_reference_max_shares
## n_tokens_title                                    0.00
## n_tokens_content                                  0.03
## n_unique_tokens                                   0.00
## n_non_stop_words                                  0.00
## n_non_stop_unique_tokens                          0.00
## num_hrefs                                         0.08
## num_self_hrefs                                    0.13
## num_imgs                                          0.04
## num_videos                                        0.08
## average_token_length                              0.04
## num_keywords                                      0.01
## kw_min_min                                       -0.04
## kw_max_min                                        0.04
## kw_avg_min                                        0.03
## kw_min_max                                        0.03
## kw_max_max                                        0.05
## kw_avg_max                                        0.09
## kw_min_avg                                        0.06
## kw_max_avg                                        0.21
## kw_avg_avg                                        0.20
## self_reference_min_shares                         0.45
## self_reference_max_shares                         1.00
## self_reference_avg_sharess                        0.86
## LDA_00                                            0.01
## LDA_01                                           -0.02
## LDA_02                                           -0.05
## LDA_03                                            0.05
## LDA_04                                            0.00
## global_subjectivity                               0.07
## global_sentiment_polarity                         0.01
## global_rate_positive_words                        0.02
## global_rate_negative_words                        0.02
## rate_positive_words                               0.02
## rate_negative_words                               0.02
## avg_positive_polarity                             0.05
## min_positive_polarity                             0.00
## max_positive_polarity                             0.04
## avg_negative_polarity                            -0.06
## min_negative_polarity                            -0.05
## max_negative_polarity                            -0.02
## title_subjectivity                                0.02
## title_sentiment_polarity                          0.00
## abs_title_subjectivity                            0.00
## abs_title_sentiment_polarity                      0.01
## shares                                            0.06
##                              self_reference_avg_sharess LDA_00 LDA_01 LDA_02
## n_tokens_title                                     0.00  -0.07   0.06   0.04
## n_tokens_content                                  -0.01   0.02  -0.01   0.08
## n_unique_tokens                                    0.00   0.00   0.00  -0.01
## n_non_stop_words                                   0.00   0.00   0.00   0.00
## n_non_stop_unique_tokens                           0.00   0.00   0.00  -0.01
## num_hrefs                                          0.03  -0.03  -0.05  -0.01
## num_self_hrefs                                     0.02  -0.02   0.00  -0.09
## num_imgs                                           0.02  -0.15   0.05  -0.10
## num_videos                                         0.04  -0.07   0.02  -0.09
## average_token_length                               0.04   0.08  -0.02   0.10
## num_keywords                                       0.01  -0.16  -0.06  -0.01
## kw_min_min                                        -0.04   0.03   0.01  -0.07
## kw_max_min                                         0.03   0.00  -0.01  -0.02
## kw_avg_min                                         0.02   0.03  -0.01  -0.04
## kw_min_max                                         0.02   0.05  -0.02  -0.06
## kw_max_max                                         0.05  -0.03  -0.01   0.07
## kw_avg_max                                         0.09   0.12  -0.08  -0.12
## kw_min_avg                                         0.06   0.00  -0.01  -0.13
## kw_max_avg                                         0.19  -0.02   0.00  -0.11
## kw_avg_avg                                         0.19  -0.04   0.01  -0.28
## self_reference_min_shares                          0.79   0.00  -0.01  -0.05
## self_reference_max_shares                          0.86   0.01  -0.02  -0.05
## self_reference_avg_sharess                         1.00   0.00  -0.01  -0.05
## LDA_00                                             0.00   1.00  -0.18  -0.26
## LDA_01                                            -0.01  -0.18   1.00  -0.23
## LDA_02                                            -0.05  -0.26  -0.23   1.00
## LDA_03                                             0.05  -0.27  -0.12  -0.33
## LDA_04                                             0.00  -0.24  -0.25  -0.23
## global_subjectivity                                0.08  -0.01   0.05  -0.17
## global_sentiment_polarity                          0.01   0.11  -0.01  -0.22
## global_rate_positive_words                         0.01   0.15   0.06  -0.23
## global_rate_negative_words                         0.03  -0.08   0.09   0.02
## rate_positive_words                                0.02   0.16  -0.01  -0.14
## rate_negative_words                                0.03  -0.12   0.03   0.19
## avg_positive_polarity                              0.05   0.00   0.05  -0.14
## min_positive_polarity                              0.03  -0.11   0.02  -0.06
## max_positive_polarity                              0.03   0.05   0.04  -0.11
## avg_negative_polarity                             -0.06   0.05  -0.10   0.03
## min_negative_polarity                             -0.03   0.04  -0.06  -0.07
## max_negative_polarity                             -0.04  -0.01  -0.04   0.08
## title_subjectivity                                 0.01  -0.05   0.06  -0.07
## title_sentiment_polarity                           0.00   0.03  -0.01  -0.07
## abs_title_subjectivity                             0.00   0.00  -0.07   0.05
## abs_title_sentiment_polarity                       0.01  -0.03   0.04  -0.08
## shares                                             0.07   0.00  -0.01  -0.06
##                              LDA_03 LDA_04 global_subjectivity
## n_tokens_title                 0.04  -0.06               -0.06
## n_tokens_content              -0.14   0.04                0.13
## n_unique_tokens                0.00  -0.01               -0.01
## n_non_stop_words              -0.01   0.00               -0.01
## n_non_stop_unique_tokens      -0.01   0.00                0.00
## num_hrefs                      0.13  -0.06                0.20
## num_self_hrefs                -0.03   0.14                0.11
## num_imgs                       0.21  -0.02                0.08
## num_videos                     0.27  -0.13                0.08
## average_token_length          -0.18   0.03                0.59
## num_keywords                   0.02   0.18                0.04
## kw_min_min                    -0.01   0.05                0.04
## kw_max_min                     0.03   0.00                0.01
## kw_avg_min                     0.01   0.01                0.02
## kw_min_max                     0.08  -0.06               -0.02
## kw_max_max                     0.02  -0.05               -0.04
## kw_avg_max                     0.30  -0.23               -0.06
## kw_min_avg                     0.19  -0.05                0.02
## kw_max_avg                     0.22  -0.09                0.04
## kw_avg_avg                     0.45  -0.16                0.07
## self_reference_min_shares      0.04   0.01                0.06
## self_reference_max_shares      0.05   0.00                0.07
## self_reference_avg_sharess     0.05   0.00                0.08
## LDA_00                        -0.27  -0.24               -0.01
## LDA_01                        -0.12  -0.25                0.05
## LDA_02                        -0.33  -0.23               -0.17
## LDA_03                         1.00  -0.36                0.08
## LDA_04                        -0.36   1.00                0.05
## global_subjectivity            0.08   0.05                1.00
## global_sentiment_polarity     -0.01   0.13                0.34
## global_rate_positive_words    -0.05   0.09                0.46
## global_rate_negative_words     0.07  -0.10                0.25
## rate_positive_words           -0.17   0.17                0.48
## rate_negative_words            0.01  -0.11                0.13
## avg_positive_polarity          0.08   0.02                0.63
## min_positive_polarity          0.11   0.02                0.25
## max_positive_polarity         -0.01   0.04                0.51
## avg_negative_polarity         -0.12   0.12               -0.45
## min_negative_polarity         -0.03   0.10               -0.35
## max_negative_polarity         -0.08   0.04               -0.22
## title_subjectivity             0.12  -0.05                0.12
## title_sentiment_polarity       0.03   0.02                0.05
## abs_title_subjectivity        -0.01   0.01               -0.01
## abs_title_sentiment_polarity   0.11  -0.04                0.09
## shares                         0.08  -0.01                0.03
##                              global_sentiment_polarity
## n_tokens_title                                   -0.08
## n_tokens_content                                  0.02
## n_unique_tokens                                   0.00
## n_non_stop_words                                  0.00
## n_non_stop_unique_tokens                          0.00
## num_hrefs                                         0.09
## num_self_hrefs                                    0.09
## num_imgs                                          0.03
## num_videos                                       -0.03
## average_token_length                              0.18
## num_keywords                                      0.08
## kw_min_min                                        0.08
## kw_max_min                                        0.01
## kw_avg_min                                        0.03
## kw_min_max                                       -0.01
## kw_max_max                                       -0.09
## kw_avg_max                                       -0.08
## kw_min_avg                                        0.00
## kw_max_avg                                        0.01
## kw_avg_avg                                        0.02
## self_reference_min_shares                         0.01
## self_reference_max_shares                         0.01
## self_reference_avg_sharess                        0.01
## LDA_00                                            0.11
## LDA_01                                           -0.01
## LDA_02                                           -0.22
## LDA_03                                           -0.01
## LDA_04                                            0.13
## global_subjectivity                               0.34
## global_sentiment_polarity                         1.00
## global_rate_positive_words                        0.57
## global_rate_negative_words                       -0.47
## rate_positive_words                               0.73
## rate_negative_words                              -0.65
## avg_positive_polarity                             0.50
## min_positive_polarity                             0.09
## max_positive_polarity                             0.43
## avg_negative_polarity                             0.23
## min_negative_polarity                             0.28
## max_negative_polarity                            -0.05
## title_subjectivity                                0.02
## title_sentiment_polarity                          0.24
## abs_title_subjectivity                           -0.02
## abs_title_sentiment_polarity                      0.07
## shares                                            0.00
##                              global_rate_positive_words
## n_tokens_title                                    -0.07
## n_tokens_content                                   0.13
## n_unique_tokens                                    0.00
## n_non_stop_words                                   0.00
## n_non_stop_unique_tokens                           0.00
## num_hrefs                                          0.06
## num_self_hrefs                                     0.12
## num_imgs                                          -0.04
## num_videos                                         0.08
## average_token_length                               0.32
## num_keywords                                       0.05
## kw_min_min                                         0.10
## kw_max_min                                         0.00
## kw_avg_min                                         0.03
## kw_min_max                                        -0.01
## kw_max_max                                        -0.11
## kw_avg_max                                        -0.11
## kw_min_avg                                         0.00
## kw_max_avg                                        -0.01
## kw_avg_avg                                        -0.01
## self_reference_min_shares                          0.01
## self_reference_max_shares                          0.02
## self_reference_avg_sharess                         0.01
## LDA_00                                             0.15
## LDA_01                                             0.06
## LDA_02                                            -0.23
## LDA_03                                            -0.05
## LDA_04                                             0.09
## global_subjectivity                                0.46
## global_sentiment_polarity                          0.57
## global_rate_positive_words                         1.00
## global_rate_negative_words                         0.11
## rate_positive_words                                0.63
## rate_negative_words                               -0.33
## avg_positive_polarity                              0.33
## min_positive_polarity                             -0.11
## max_positive_polarity                              0.48
## avg_negative_polarity                             -0.14
## min_negative_polarity                             -0.12
## max_negative_polarity                             -0.08
## title_subjectivity                                 0.11
## title_sentiment_polarity                           0.14
## abs_title_subjectivity                            -0.13
## abs_title_sentiment_polarity                       0.10
## shares                                             0.00
##                              global_rate_negative_words rate_positive_words
## n_tokens_title                                     0.02               -0.07
## n_tokens_content                                   0.12                0.10
## n_unique_tokens                                    0.00               -0.01
## n_non_stop_words                                   0.00                0.00
## n_non_stop_unique_tokens                           0.00                0.00
## num_hrefs                                          0.03                0.10
## num_self_hrefs                                     0.01                0.14
## num_imgs                                           0.02               -0.02
## num_videos                                         0.19               -0.05
## average_token_length                               0.23                0.57
## num_keywords                                      -0.04                0.03
## kw_min_min                                        -0.01                0.09
## kw_max_min                                        -0.01                0.00
## kw_avg_min                                        -0.01                0.03
## kw_min_max                                         0.00               -0.03
## kw_max_max                                         0.02               -0.10
## kw_avg_max                                         0.01               -0.16
## kw_min_avg                                         0.02               -0.06
## kw_max_avg                                         0.01               -0.04
## kw_avg_avg                                         0.03               -0.10
## self_reference_min_shares                          0.02                0.02
## self_reference_max_shares                          0.02                0.02
## self_reference_avg_sharess                         0.03                0.02
## LDA_00                                            -0.08                0.16
## LDA_01                                             0.09               -0.01
## LDA_02                                             0.02               -0.14
## LDA_03                                             0.07               -0.17
## LDA_04                                            -0.10                0.17
## global_subjectivity                                0.25                0.48
## global_sentiment_polarity                         -0.47                0.73
## global_rate_positive_words                         0.11                0.63
## global_rate_negative_words                         1.00               -0.40
## rate_positive_words                               -0.40                1.00
## rate_negative_words                                0.78               -0.54
## avg_positive_polarity                              0.19                0.41
## min_positive_polarity                              0.06                0.05
## max_positive_polarity                              0.18                0.45
## avg_negative_polarity                             -0.34               -0.04
## min_negative_polarity                             -0.47                0.09
## max_negative_polarity                              0.07               -0.20
## title_subjectivity                                 0.09               -0.02
## title_sentiment_polarity                          -0.13                0.15
## abs_title_subjectivity                            -0.06               -0.02
## abs_title_sentiment_polarity                       0.06                0.00
## shares                                             0.00               -0.01
##                              rate_negative_words avg_positive_polarity
## n_tokens_title                              0.04                 -0.05
## n_tokens_content                            0.10                  0.13
## n_unique_tokens                             0.00                 -0.01
## n_non_stop_words                            0.00                  0.00
## n_non_stop_unique_tokens                    0.00                  0.00
## num_hrefs                                   0.06                  0.18
## num_self_hrefs                             -0.01                  0.10
## num_imgs                                    0.04                  0.10
## num_videos                                  0.07                  0.10
## average_token_length                        0.32                  0.54
## num_keywords                               -0.07                  0.04
## kw_min_min                                 -0.05                  0.04
## kw_max_min                                 -0.01                  0.01
## kw_avg_min                                 -0.02                  0.02
## kw_min_max                                 -0.01                 -0.01
## kw_max_max                                  0.06                 -0.04
## kw_avg_max                                  0.00                 -0.07
## kw_min_avg                                 -0.01                  0.01
## kw_max_avg                                 -0.01                  0.03
## kw_avg_avg                                 -0.03                  0.05
## self_reference_min_shares                   0.02                  0.05
## self_reference_max_shares                   0.02                  0.05
## self_reference_avg_sharess                  0.03                  0.05
## LDA_00                                     -0.12                  0.00
## LDA_01                                      0.03                  0.05
## LDA_02                                      0.19                 -0.14
## LDA_03                                      0.01                  0.08
## LDA_04                                     -0.11                  0.02
## global_subjectivity                         0.13                  0.63
## global_sentiment_polarity                  -0.65                  0.50
## global_rate_positive_words                 -0.33                  0.33
## global_rate_negative_words                  0.78                  0.19
## rate_positive_words                        -0.54                  0.41
## rate_negative_words                         1.00                  0.14
## avg_positive_polarity                       0.14                  1.00
## min_positive_polarity                       0.19                  0.46
## max_positive_polarity                       0.03                  0.70
## avg_negative_polarity                      -0.34                 -0.27
## min_negative_polarity                      -0.45                 -0.23
## max_negative_polarity                       0.03                 -0.14
## title_subjectivity                          0.00                  0.06
## title_sentiment_polarity                   -0.18                  0.09
## abs_title_subjectivity                      0.03                  0.01
## abs_title_sentiment_polarity               -0.02                  0.10
## shares                                     -0.01                  0.01
##                              min_positive_polarity max_positive_polarity
## n_tokens_title                               -0.02                 -0.02
## n_tokens_content                             -0.26                  0.41
## n_unique_tokens                               0.01                 -0.01
## n_non_stop_words                              0.00                  0.00
## n_non_stop_unique_tokens                      0.00                 -0.01
## num_hrefs                                    -0.08                  0.28
## num_self_hrefs                               -0.08                  0.19
## num_imgs                                     -0.02                  0.16
## num_videos                                   -0.01                  0.13
## average_token_length                          0.22                  0.47
## num_keywords                                 -0.01                  0.07
## kw_min_min                                    0.02                  0.02
## kw_max_min                                    0.01                  0.01
## kw_avg_min                                    0.01                  0.02
## kw_min_max                                    0.00                 -0.02
## kw_max_max                                   -0.02                 -0.02
## kw_avg_max                                    0.01                 -0.10
## kw_min_avg                                    0.02                 -0.01
## kw_max_avg                                    0.03                  0.01
## kw_avg_avg                                    0.06                  0.00
## self_reference_min_shares                     0.04                  0.01
## self_reference_max_shares                     0.00                  0.04
## self_reference_avg_sharess                    0.03                  0.03
## LDA_00                                       -0.11                  0.05
## LDA_01                                        0.02                  0.04
## LDA_02                                       -0.06                 -0.11
## LDA_03                                        0.11                 -0.01
## LDA_04                                        0.02                  0.04
## global_subjectivity                           0.25                  0.51
## global_sentiment_polarity                     0.09                  0.43
## global_rate_positive_words                   -0.11                  0.48
## global_rate_negative_words                    0.06                  0.18
## rate_positive_words                           0.05                  0.45
## rate_negative_words                           0.19                  0.03
## avg_positive_polarity                         0.46                  0.70
## min_positive_polarity                         1.00                  0.01
## max_positive_polarity                         0.01                  1.00
## avg_negative_polarity                        -0.07                 -0.27
## min_negative_polarity                         0.09                 -0.36
## max_negative_polarity                        -0.19                  0.00
## title_subjectivity                            0.01                  0.06
## title_sentiment_polarity                      0.00                  0.08
## abs_title_subjectivity                        0.02                 -0.02
## abs_title_sentiment_polarity                  0.02                  0.08
## shares                                        0.00                  0.01
##                              avg_negative_polarity min_negative_polarity
## n_tokens_title                               -0.02                 -0.03
## n_tokens_content                             -0.13                 -0.45
## n_unique_tokens                               0.00                  0.01
## n_non_stop_words                              0.00                  0.00
## n_non_stop_unique_tokens                      0.00                  0.01
## num_hrefs                                    -0.15                 -0.26
## num_self_hrefs                               -0.06                 -0.12
## num_imgs                                     -0.07                 -0.13
## num_videos                                   -0.12                 -0.14
## average_token_length                         -0.32                 -0.28
## num_keywords                                  0.02                 -0.01
## kw_min_min                                    0.02                  0.06
## kw_max_min                                    0.00                  0.00
## kw_avg_min                                    0.00                  0.01
## kw_min_max                                    0.00                  0.01
## kw_max_max                                   -0.02                 -0.06
## kw_avg_max                                   -0.02                  0.01
## kw_min_avg                                   -0.02                 -0.01
## kw_max_avg                                   -0.04                 -0.02
## kw_avg_avg                                   -0.07                 -0.02
## self_reference_min_shares                    -0.04                 -0.01
## self_reference_max_shares                    -0.06                 -0.05
## self_reference_avg_sharess                   -0.06                 -0.03
## LDA_00                                        0.05                  0.04
## LDA_01                                       -0.10                 -0.06
## LDA_02                                        0.03                 -0.07
## LDA_03                                       -0.12                 -0.03
## LDA_04                                        0.12                  0.10
## global_subjectivity                          -0.45                 -0.35
## global_sentiment_polarity                     0.23                  0.28
## global_rate_positive_words                   -0.14                 -0.12
## global_rate_negative_words                   -0.34                 -0.47
## rate_positive_words                          -0.04                  0.09
## rate_negative_words                          -0.34                 -0.45
## avg_positive_polarity                        -0.27                 -0.23
## min_positive_polarity                        -0.07                  0.09
## max_positive_polarity                        -0.27                 -0.36
## avg_negative_polarity                         1.00                  0.75
## min_negative_polarity                         0.75                  1.00
## max_negative_polarity                         0.58                  0.08
## title_subjectivity                           -0.08                 -0.06
## title_sentiment_polarity                      0.08                  0.08
## abs_title_subjectivity                       -0.01                 -0.01
## abs_title_sentiment_polarity                 -0.08                 -0.06
## shares                                       -0.03                 -0.02
##                              max_negative_polarity title_subjectivity
## n_tokens_title                                0.01               0.08
## n_tokens_content                              0.23               0.00
## n_unique_tokens                               0.00              -0.01
## n_non_stop_words                              0.00              -0.01
## n_non_stop_unique_tokens                      0.00              -0.01
## num_hrefs                                     0.06               0.04
## num_self_hrefs                                0.04              -0.01
## num_imgs                                      0.04               0.06
## num_videos                                    0.03               0.07
## average_token_length                         -0.19              -0.03
## num_keywords                                  0.03               0.01
## kw_min_min                                   -0.03               0.00
## kw_max_min                                    0.01               0.01
## kw_avg_min                                    0.00               0.00
## kw_min_max                                   -0.01               0.03
## kw_max_max                                    0.03               0.00
## kw_avg_max                                   -0.02               0.04
## kw_min_avg                                   -0.01               0.05
## kw_max_avg                                   -0.02               0.05
## kw_avg_avg                                   -0.04               0.09
## self_reference_min_shares                    -0.05               0.01
## self_reference_max_shares                    -0.02               0.02
## self_reference_avg_sharess                   -0.04               0.01
## LDA_00                                       -0.01              -0.05
## LDA_01                                       -0.04               0.06
## LDA_02                                        0.08              -0.07
## LDA_03                                       -0.08               0.12
## LDA_04                                        0.04              -0.05
## global_subjectivity                          -0.22               0.12
## global_sentiment_polarity                    -0.05               0.02
## global_rate_positive_words                   -0.08               0.11
## global_rate_negative_words                    0.07               0.09
## rate_positive_words                          -0.20              -0.02
## rate_negative_words                           0.03               0.00
## avg_positive_polarity                        -0.14               0.06
## min_positive_polarity                        -0.19               0.01
## max_positive_polarity                         0.00               0.06
## avg_negative_polarity                         0.58              -0.08
## min_negative_polarity                         0.08              -0.06
## max_negative_polarity                         1.00              -0.02
## title_subjectivity                           -0.02               1.00
## title_sentiment_polarity                      0.00               0.24
## abs_title_subjectivity                        0.00              -0.49
## abs_title_sentiment_polarity                 -0.02               0.71
## shares                                       -0.02               0.01
##                              title_sentiment_polarity abs_title_subjectivity
## n_tokens_title                                   0.00                  -0.15
## n_tokens_content                                 0.02                   0.01
## n_unique_tokens                                  0.00                  -0.01
## n_non_stop_words                                 0.00                  -0.01
## n_non_stop_unique_tokens                         0.00                  -0.01
## num_hrefs                                        0.04                   0.01
## num_self_hrefs                                   0.03                   0.01
## num_imgs                                         0.05                  -0.01
## num_videos                                       0.02                  -0.03
## average_token_length                            -0.01                   0.02
## num_keywords                                     0.03                   0.00
## kw_min_min                                       0.03                   0.01
## kw_max_min                                       0.01                   0.00
## kw_avg_min                                       0.01                   0.01
## kw_min_max                                       0.01                  -0.02
## kw_max_max                                      -0.03                  -0.01
## kw_avg_max                                       0.00                  -0.02
## kw_min_avg                                       0.03                  -0.02
## kw_max_avg                                       0.02                  -0.01
## kw_avg_avg                                       0.04                  -0.02
## self_reference_min_shares                        0.01                   0.00
## self_reference_max_shares                        0.00                   0.00
## self_reference_avg_sharess                       0.00                   0.00
## LDA_00                                           0.03                   0.00
## LDA_01                                          -0.01                  -0.07
## LDA_02                                          -0.07                   0.05
## LDA_03                                           0.03                  -0.01
## LDA_04                                           0.02                   0.01
## global_subjectivity                              0.05                  -0.01
## global_sentiment_polarity                        0.24                  -0.02
## global_rate_positive_words                       0.14                  -0.13
## global_rate_negative_words                      -0.13                  -0.06
## rate_positive_words                              0.15                  -0.02
## rate_negative_words                             -0.18                   0.03
## avg_positive_polarity                            0.09                   0.01
## min_positive_polarity                            0.00                   0.02
## max_positive_polarity                            0.08                  -0.02
## avg_negative_polarity                            0.08                  -0.01
## min_negative_polarity                            0.08                  -0.01
## max_negative_polarity                            0.00                   0.00
## title_subjectivity                               0.24                  -0.49
## title_sentiment_polarity                         1.00                  -0.24
## abs_title_subjectivity                          -0.24                   1.00
## abs_title_sentiment_polarity                     0.42                  -0.40
## shares                                           0.01                   0.01
##                              abs_title_sentiment_polarity shares
## n_tokens_title                                       0.04   0.01
## n_tokens_content                                     0.02   0.01
## n_unique_tokens                                      0.00   0.00
## n_non_stop_words                                     0.00   0.00
## n_non_stop_unique_tokens                            -0.01   0.00
## num_hrefs                                            0.06   0.04
## num_self_hrefs                                       0.00   0.00
## num_imgs                                             0.07   0.03
## num_videos                                           0.06   0.02
## average_token_length                                -0.03  -0.03
## num_keywords                                         0.02   0.02
## kw_min_min                                           0.01   0.00
## kw_max_min                                           0.01   0.04
## kw_avg_min                                           0.01   0.04
## kw_min_max                                           0.03   0.01
## kw_max_max                                          -0.01   0.00
## kw_avg_max                                           0.03   0.04
## kw_min_avg                                           0.06   0.03
## kw_max_avg                                           0.05   0.07
## kw_avg_avg                                           0.09   0.11
## self_reference_min_shares                            0.01   0.07
## self_reference_max_shares                            0.01   0.06
## self_reference_avg_sharess                           0.01   0.07
## LDA_00                                              -0.03   0.00
## LDA_01                                               0.04  -0.01
## LDA_02                                              -0.08  -0.06
## LDA_03                                               0.11   0.08
## LDA_04                                              -0.04  -0.01
## global_subjectivity                                  0.09   0.03
## global_sentiment_polarity                            0.07   0.00
## global_rate_positive_words                           0.10   0.00
## global_rate_negative_words                           0.06   0.00
## rate_positive_words                                  0.00  -0.01
## rate_negative_words                                 -0.02  -0.01
## avg_positive_polarity                                0.10   0.01
## min_positive_polarity                                0.02   0.00
## max_positive_polarity                                0.08   0.01
## avg_negative_polarity                               -0.08  -0.03
## min_negative_polarity                               -0.06  -0.02
## max_negative_polarity                               -0.02  -0.02
## title_subjectivity                                   0.71   0.01
## title_sentiment_polarity                             0.42   0.01
## abs_title_subjectivity                              -0.40   0.01
## abs_title_sentiment_polarity                         1.00   0.02
## shares                                               0.02   1.00

Its a long matrix and not easy to read. We can even visualize it by using corrplot function. But we will use the alternative function from the caret package: findCorrelation(x = correlation_matrix, cutoff = 0.9) which identifies correlations above the accepted threshold and indicates variables to be deleted. By default it returns a list of column numbers to be deleted. lets decrease the cutoff to 0.75 abve 0.75 there could be some who are equally correlated but function returns only one because it compares both with other predictors and the one thats average correlation is higher is suggested to be removed.

findCorrelation(shares_correlation,
                cutoff = 0.75,
                names = TRUE) -> shares_crr 
shares_crr 
## [1] "kw_avg_avg"                "rate_negative_words"      
## [3] "kw_min_min"                "self_reference_max_shares"
## [5] "kw_max_min"                "self_reference_min_shares"
## [7] "n_non_stop_words"          "n_non_stop_unique_tokens"
#list names of potential variables to be removed

Correlation with the Outcome Variable (Shares)

Lets sort the predictors on basis of their correlation with the outcome variable in a decreasing order. And then use it to order rows and columns of the correlation matrix on the plot.

correlation_order <-
  # we take correlations with number of shares
  shares_correlation[,"shares"] %>%
  # sort them in the decreasing order
  sort(decreasing = TRUE) %>%
  # end extract just variables' names
  names()

correlation_order
##  [1] "shares"                       "kw_avg_avg"                  
##  [3] "LDA_03"                       "self_reference_min_shares"   
##  [5] "self_reference_avg_sharess"   "kw_max_avg"                  
##  [7] "self_reference_max_shares"    "num_hrefs"                   
##  [9] "kw_avg_max"                   "kw_max_min"                  
## [11] "kw_avg_min"                   "kw_min_avg"                  
## [13] "num_imgs"                     "global_subjectivity"         
## [15] "num_videos"                   "num_keywords"                
## [17] "abs_title_sentiment_polarity" "title_subjectivity"          
## [19] "n_tokens_title"               "title_sentiment_polarity"    
## [21] "avg_positive_polarity"        "max_positive_polarity"       
## [23] "n_tokens_content"             "kw_min_max"                  
## [25] "abs_title_subjectivity"       "kw_max_max"                  
## [27] "global_rate_negative_words"   "global_sentiment_polarity"   
## [29] "kw_min_min"                   "global_rate_positive_words"  
## [31] "n_unique_tokens"              "num_self_hrefs"              
## [33] "n_non_stop_words"             "n_non_stop_unique_tokens"    
## [35] "min_positive_polarity"        "LDA_00"                      
## [37] "rate_negative_words"          "LDA_01"                      
## [39] "rate_positive_words"          "LDA_04"                      
## [41] "min_negative_polarity"        "max_negative_polarity"       
## [43] "average_token_length"         "avg_negative_polarity"       
## [45] "LDA_02"
corrplot.mixed(shares_correlation[correlation_order, correlation_order],
               upper = "square", lower = "number",
               tl.col="black", tl.pos = "lt")  

Its not really readable and correlations with the outcome variable do not seem that strong but lets take the first 30 variables.

corrplot.mixed(shares_correlation[correlation_order[1:30],
                                   correlation_order[1:30]],
               upper = "square",
               lower = "number",
               tl.col="black", # color of labels (variable names)
               tl.pos = "lt")

Lets now see the relationship of first 3 most important predictors with the outcome variable i.e shares. The following scatterplots clearly demonstrate a positive linear relationship with the outcome variable.

ggplot(train_data,
       aes(x = kw_avg_avg,
           y = log(shares))) +
  geom_point(col = "blue") +
  geom_smooth(method = "lm", se = FALSE) +
  theme_bw() +
  labs(title = "Scatter plot of kw_avg_avg vs. log(shares)")+
  theme(plot.title = element_text(hjust = 0.4))
## `geom_smooth()` using formula 'y ~ x'

ggplot(train_data,
       aes(x = LDA_03,
           y = log(shares))) +
  geom_point(col = "blue") +
  geom_smooth(method = "lm", se = FALSE) +
  theme_bw()+
  labs(title = "Scatter plot of LDA_03 vs. log(shares)")+
  theme(plot.title = element_text(hjust = 0.4))
## `geom_smooth()` using formula 'y ~ x'

ggplot(train_data,
       aes(x = self_reference_min_shares,
           y = log(shares))) +
  geom_point(col = "blue") +
  geom_smooth(method = "lm", se = FALSE) +
  theme_bw()+
  labs(title = "Scatter plot ofself_reference_min_shares vs. log(shares)")+
  theme(plot.title = element_text(hjust = 0.4))
## `geom_smooth()` using formula 'y ~ x'

Correlation between Categorical vs Outcome Variable

As the target variable is quantitative and explanatory qualitative, one can use analysis of variance (ANOVA). We want to check whether e.g our outcome variable shares differ for our predictor i.e weekday_is_friday which have two levels (Yes and No) whether the shares differ for different levels. The F statistic or the p-value can be used to varify this hypothesis.The null hypothesis is that: H0: weekday_is_friday does not impact the shares i.e. average shares does not differ for different values of weekday_is_friday. Ha: Weekday_is_friday does impact shares The higher the F-statistic (or the lower its p-value) the stronger we reject H0. We will use a function and extract only the p- value for all of our categorical variables. After that we will sort them in the descending order based on p-value.

shares_F_anova <- function(shares_categorical){
  anova_ <- aov(train_data$shares ~
                  train_data[[shares_categorical]])

  return(summary(anova_)[[1]][1, 5])
}

sapply(shares_categorical,
       shares_F_anova) %>%
  # in addition lets sort them
  # in the decreasing order of F
  #  and store as an object
  sort(decreasing = TRUE) -> shares_anova_all_categorical

shares_anova_all_categorical
##             weekday_is_friday        data_channel_is_socmed 
##                  7.452427e-01                  3.538015e-01 
##          weekday_is_wednesday           weekday_is_thursday 
##                  2.979947e-01                  2.420819e-01 
##     data_channel_is_lifestyle             weekday_is_sunday 
##                  1.738559e-01                  1.588156e-01 
##            weekday_is_tuesday             weekday_is_monday 
##                  1.303174e-01                  1.099716e-01 
##          data_channel_is_tech           data_channel_is_bus 
##                  7.059908e-02                  3.349409e-02 
##           weekday_is_saturday                    is_weekend 
##                  1.244472e-02                  4.440307e-03 
## data_channel_is_entertainment         data_channel_is_world 
##                  3.747287e-03                  9.025154e-16

From the above p-values we could see that apart from data_channel_is_bus , weekday_is_saturday is_weekend, data_channel_is_entertainment and data_channel_is_world remaining predictors do not impact our outcome variable as their p-values area greater then 0.05 which indicates that for these predictors we will accept null hypothesis which states that they do not impact shares. So we will just include those predictors in our analysis for which the p-value is less then 0.05.

ggplot(train_data,
       aes(x = data_channel_is_entertainment ,
           y = log(shares))) +
  geom_boxplot(fill = "blue") +
  theme_bw()+
  labs(title = "Boxplot of log(shares) vs. data_channel_is_entertainment")+
  theme(plot.title = element_text(hjust = 0.5))

Identifying Linear combinations

Identification of linear relationships in the data findLinearCombos(data) is a very useful function, which decomposes the data matrix to identify groups of variables that are linear combinations of other variables we will treat such variables as redundant because since they can be determined from the others, they bring nothing to the analysis lets check it for numeric variables.

( findLinearCombos(train_data[, shares_numerical] ) ->
    houses_linearCombos )
## $linearCombos
## list()
## 
## $remove
## NULL

It suggests we dont have any linear combinations in our training data thats why it retruns null.

Identifying Features with Zero or Near Zero Variance

We obviously don’t want high concentration of features in a single value issue in data zero variance.to To identify such variables we can use the function nearZeroVar(data, saveMetrics = FALSE) from the caret package.

nearZeroVar(train_data,
            saveMetrics = TRUE) -> shares_nzv_stats

shares_nzv_stats %>%
  # we add rownames of the frame
  # (with names of variables)
  # as a new column in the data
  rownames_to_column("variable") %>%
  # and sort it in the descreasing order
  arrange(-zeroVar, -nzv, -freqRatio)
##                         variable  freqRatio percentUnique zeroVar   nzv
## 1                     kw_min_max  44.906015   3.390746613   FALSE  TRUE
## 2               n_non_stop_words  33.294190   0.014413376   FALSE  TRUE
## 3                     kw_min_avg 114.807692  43.110406457   FALSE FALSE
## 4      global_sentiment_polarity  37.409091  89.031421159   FALSE FALSE
## 5            global_subjectivity  33.833333  88.732343615   FALSE FALSE
## 6     global_rate_negative_words  30.368421  30.300518882   FALSE FALSE
## 7      data_channel_is_lifestyle  17.713419   0.007206688   FALSE FALSE
## 8         data_channel_is_socmed  16.205208   0.007206688   FALSE FALSE
## 9            weekday_is_saturday  14.976972   0.007206688   FALSE FALSE
## 10          average_token_length  13.483333  80.210435284   FALSE FALSE
## 11             weekday_is_sunday  13.261048   0.007206688   FALSE FALSE
## 12    self_reference_avg_sharess  12.080488  23.807293168   FALSE FALSE
## 13              n_tokens_content  11.724638   7.909339867   FALSE FALSE
## 14                    kw_avg_avg   9.833333  99.243297780   FALSE FALSE
## 15    global_rate_positive_words   9.511364  38.173825310   FALSE FALSE
## 16      title_sentiment_polarity   9.460285   2.489910637   FALSE FALSE
## 17     self_reference_max_shares   9.089908   3.823147881   FALSE FALSE
## 18                    kw_avg_min   7.934426  49.409051600   FALSE FALSE
## 19        abs_title_subjectivity   7.821311   1.650331508   FALSE FALSE
## 20               n_unique_tokens   7.632075  61.833381378   FALSE FALSE
## 21                    kw_max_max   7.598422   0.093686941   FALSE FALSE
## 22  abs_title_sentiment_polarity   7.565147   2.014269242   FALSE FALSE
## 23            title_subjectivity   6.891197   2.082732776   FALSE FALSE
## 24                    is_weekend   6.535162   0.007206688   FALSE FALSE
## 25             weekday_is_friday   6.043655   0.007206688   FALSE FALSE
## 26         avg_positive_polarity   5.978571  71.908330931   FALSE FALSE
## 27      n_non_stop_unique_tokens   5.905109  47.171375036   FALSE FALSE
## 28           data_channel_is_bus   5.249043   0.007206688   FALSE FALSE
## 29             weekday_is_monday   4.965606   0.007206688   FALSE FALSE
## 30         avg_negative_polarity   4.835196  38.105361776   FALSE FALSE
## 31 data_channel_is_entertainment   4.616677   0.007206688   FALSE FALSE
## 32           weekday_is_thursday   4.473767   0.007206688   FALSE FALSE
## 33          data_channel_is_tech   4.451188   0.007206688   FALSE FALSE
## 34          weekday_is_wednesday   4.337950   0.007206688   FALSE FALSE
## 35            weekday_is_tuesday   4.328725   0.007206688   FALSE FALSE
## 36     self_reference_min_shares   4.076543   4.284375901   FALSE FALSE
## 37         data_channel_is_world   3.683091   0.007206688   FALSE FALSE
## 38                    num_videos   2.654242   0.187373883   FALSE FALSE
## 39                      num_imgs   2.600328   0.320697607   FALSE FALSE
## 40                        LDA_04   2.400000  72.427212453   FALSE FALSE
## 41         max_positive_polarity   2.241922   0.136927068   FALSE FALSE
## 42         min_positive_polarity   2.067956   0.115307005   FALSE FALSE
## 43                    kw_min_min   1.928623   0.064860190   FALSE FALSE
## 44                    kw_max_avg   1.735294  54.435716345   FALSE FALSE
## 45           rate_negative_words   1.492241   7.087777457   FALSE FALSE
## 46                        LDA_00   1.341463  75.583741712   FALSE FALSE
## 47                    kw_avg_max   1.340909  82.131017584   FALSE FALSE
## 48                    kw_max_min   1.300000   3.761891035   FALSE FALSE
## 49           rate_positive_words   1.262242   7.087777457   FALSE FALSE
## 50         min_negative_polarity   1.258532   0.190977227   FALSE FALSE
## 51                        LDA_02   1.147059  75.972902854   FALSE FALSE
## 52                        shares   1.144587   4.788844047   FALSE FALSE
## 53         max_negative_polarity   1.131511   0.176563851   FALSE FALSE
## 54                num_self_hrefs   1.119714   0.201787259   FALSE FALSE
## 55                        LDA_03   1.115385  76.585471317   FALSE FALSE
## 56                  num_keywords   1.088715   0.036033439   FALSE FALSE
## 57                n_tokens_title   1.044948   0.072066878   FALSE FALSE
## 58                     num_hrefs   1.022993   0.450417988   FALSE FALSE
## 59                        LDA_01   1.000000  77.673681176   FALSE FALSE

Just two variables have the near zero variance i.e kw_min_max and n_non_stop_words.

Final Selection of Variables

lets now create a two list of selected_variables by removing the nzv, mutually correlated with each other and weakly correlated with outcome variable

# start with all

selected_variables <- names(train_data)


#Lets take the first 15 numeric features correlated with the outcome variable and the last 4 categorical variables based on the p-values

selected_variables <- c(correlation_order[1:25],
                               names(shares_anova_all_categorical)[10:14]
                                )


#Now lets delet mutullay correlated variables as identified by findcorrelation function

selected_variables  <-
  selected_variables [!selected_variables %in%
                         shares_crr]



(shares_variables_nzv <- nearZeroVar(train_data,
                                      names = TRUE))
## [1] "n_non_stop_words" "kw_min_max"
selected_variables  <-
  selected_variables [!selected_variables  %in%
                         shares_variables_nzv]

selected_variables
##  [1] "shares"                        "LDA_03"                       
##  [3] "self_reference_avg_sharess"    "kw_max_avg"                   
##  [5] "num_hrefs"                     "kw_avg_max"                   
##  [7] "kw_avg_min"                    "kw_min_avg"                   
##  [9] "num_imgs"                      "global_subjectivity"          
## [11] "num_videos"                    "num_keywords"                 
## [13] "abs_title_sentiment_polarity"  "title_subjectivity"           
## [15] "n_tokens_title"                "title_sentiment_polarity"     
## [17] "avg_positive_polarity"         "max_positive_polarity"        
## [19] "n_tokens_content"              "abs_title_subjectivity"       
## [21] "data_channel_is_bus"           "weekday_is_saturday"          
## [23] "is_weekend"                    "data_channel_is_entertainment"
## [25] "data_channel_is_world"
shares_lm1 <- lm(log(shares) ~ ., 
                 data = train_data %>% 
                   dplyr::select(all_of(selected_variables))) 

vifs <- ols_vif_tol(shares_lm1)
vifs[which(vifs$Variables %in% shares_numerical),]
##                       Variables Tolerance      VIF
## 1                        LDA_03 0.6078630 1.645107
## 2    self_reference_avg_sharess 0.9449376 1.058271
## 3                    kw_max_avg 0.6437783 1.553330
## 4                     num_hrefs 0.7221733 1.384709
## 5                    kw_avg_max 0.5949289 1.680873
## 6                    kw_avg_min 0.7028196 1.422840
## 7                    kw_min_avg 0.7731303 1.293443
## 8                      num_imgs 0.7274738 1.374620
## 9           global_subjectivity 0.5662949 1.765864
## 10                   num_videos 0.8453049 1.183005
## 11                 num_keywords 0.7772105 1.286653
## 12 abs_title_sentiment_polarity 0.4203891 2.378749
## 13           title_subjectivity 0.4235306 2.361104
## 14               n_tokens_title 0.9275080 1.078158
## 15     title_sentiment_polarity 0.7973968 1.254081
## 16        avg_positive_polarity 0.3823022 2.615731
## 17        max_positive_polarity 0.3893029 2.568694
## 18             n_tokens_content 0.5896351 1.695964
## 19       abs_title_subjectivity 0.7196300 1.389603

In the above output we can see variance inflation factor (VIF) value. The VIF for all the predictors is small its not that high. Now lets apply the backward elimination method and find out the redundant features and remove them.

shares_lm1_backward_p <- ols_step_backward_p(shares_lm1,
                    prem = 0.05,
                    progress = F)

summary(shares_lm1_backward_p$model)
## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1734 -0.5658 -0.1768  0.4129  5.8924 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       7.058e+00  4.620e-02 152.789  < 2e-16 ***
## LDA_03                            1.500e-01  2.129e-02   7.045 1.90e-12 ***
## self_reference_avg_sharess        2.579e-06  2.396e-07  10.762  < 2e-16 ***
## kw_max_avg                        9.629e-06  9.816e-07   9.809  < 2e-16 ***
## num_hrefs                         4.359e-03  5.459e-04   7.984 1.47e-15 ***
## kw_min_avg                        5.069e-05  5.031e-06  10.075  < 2e-16 ***
## num_imgs                          2.526e-03  7.256e-04   3.481 0.000500 ***
## global_subjectivity               3.610e-01  6.029e-02   5.987 2.16e-09 ***
## num_keywords                      1.835e-02  3.030e-03   6.055 1.42e-09 ***
## title_subjectivity                8.892e-02  1.937e-02   4.591 4.43e-06 ***
## n_tokens_title                    6.041e-03  2.583e-03   2.338 0.019372 *  
## title_sentiment_polarity          8.104e-02  2.095e-02   3.869 0.000109 ***
## avg_positive_polarity            -3.490e-01  6.617e-02  -5.275 1.34e-07 ***
## n_tokens_content                  3.050e-05  1.338e-05   2.280 0.022633 *  
## abs_title_subjectivity            1.437e-01  3.318e-02   4.332 1.48e-05 ***
## data_channel_is_busYes           -1.568e-01  1.661e-02  -9.437  < 2e-16 ***
## is_weekendYes                     2.680e-01  1.578e-02  16.985  < 2e-16 ***
## data_channel_is_entertainmentYes -3.538e-01  1.540e-02 -22.978  < 2e-16 ***
## data_channel_is_worldYes         -3.768e-01  1.520e-02 -24.790  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8843 on 27733 degrees of freedom
## Multiple R-squared:  0.09147,    Adjusted R-squared:  0.09088 
## F-statistic: 155.1 on 18 and 27733 DF,  p-value: < 2.2e-16
shares_lm1_backward_p$removed
## [1] "weekday_is_saturday"          "max_positive_polarity"       
## [3] "kw_avg_min"                   "num_videos"                  
## [5] "abs_title_sentiment_polarity" "kw_avg_max"

At the end we have a list of variables that are suggested to be removed. based on the backward elimination method. Now lets remove the identified features and make a list of selected_final_variables to be used for regression purpose.

selected_final_variables <- selected_variables[-which(selected_variables %in% 
                              c("weekday_is_saturday", "max_positive_polarity", "kw_avg_min", "num_videos","abs_title_sentiment_polarity", "kw_avg_max"))]



shares_lm2 <- lm(log(shares) ~ ., 
                data = train_data %>% 
                  dplyr::select(all_of(selected_final_variables))) 

summary(shares_lm2)
## 
## Call:
## lm(formula = log(shares) ~ ., data = train_data %>% dplyr::select(all_of(selected_final_variables)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1734 -0.5658 -0.1768  0.4129  5.8924 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       7.058e+00  4.620e-02 152.789  < 2e-16 ***
## LDA_03                            1.500e-01  2.129e-02   7.045 1.90e-12 ***
## self_reference_avg_sharess        2.579e-06  2.396e-07  10.762  < 2e-16 ***
## kw_max_avg                        9.629e-06  9.816e-07   9.809  < 2e-16 ***
## num_hrefs                         4.359e-03  5.459e-04   7.984 1.47e-15 ***
## kw_min_avg                        5.069e-05  5.031e-06  10.075  < 2e-16 ***
## num_imgs                          2.526e-03  7.256e-04   3.481 0.000500 ***
## global_subjectivity               3.610e-01  6.029e-02   5.987 2.16e-09 ***
## num_keywords                      1.835e-02  3.030e-03   6.055 1.42e-09 ***
## title_subjectivity                8.892e-02  1.937e-02   4.591 4.43e-06 ***
## n_tokens_title                    6.041e-03  2.583e-03   2.338 0.019372 *  
## title_sentiment_polarity          8.104e-02  2.095e-02   3.869 0.000109 ***
## avg_positive_polarity            -3.490e-01  6.617e-02  -5.275 1.34e-07 ***
## n_tokens_content                  3.050e-05  1.338e-05   2.280 0.022633 *  
## abs_title_subjectivity            1.437e-01  3.318e-02   4.332 1.48e-05 ***
## data_channel_is_busYes           -1.568e-01  1.661e-02  -9.437  < 2e-16 ***
## is_weekendYes                     2.680e-01  1.578e-02  16.985  < 2e-16 ***
## data_channel_is_entertainmentYes -3.538e-01  1.540e-02 -22.978  < 2e-16 ***
## data_channel_is_worldYes         -3.768e-01  1.520e-02 -24.790  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8843 on 27733 degrees of freedom
## Multiple R-squared:  0.09147,    Adjusted R-squared:  0.09088 
## F-statistic: 155.1 on 18 and 27733 DF,  p-value: < 2.2e-16

The above output also shows the regression estimates. We can see all out selected_final_variables bear a significant relationship with the outcome variable.

shares_predicted <- predict(shares_lm2)

ggplot(data.frame(error = log(train_data$shares) - shares_predicted),
       aes(x = error)) +
  geom_histogram(fill = "blue", bins = 100) +
  theme_bw() + 
  labs(title = "Distribution of errors")+
  theme(plot.title = element_text(hjust = 0.5))

The above histogram presents a distribution of errors. One can see its a bit right skewed distribution.

KNN

#applying knn

different_k <- data.frame(k = seq(1, 99, 4))

trControl10 <- trainControl(method = 'cv',
                          number = 5)

set.seed(987654321)

shares_train_knn <-
  train(log(shares) ~ .,
        data = train_data%>%
         
         dplyr::select(all_of(selected_final_variables)),
        # model type - now knn!!
        method = "knn",
        # train control
        trControl = trControl10,
        # we give the parameter(s)
        # required by the model
        tuneGrid = different_k)

shares_train_knn
## k-Nearest Neighbors 
## 
## 27752 samples
##    18 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 22201, 22202, 22200, 22202, 22203 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared    MAE      
##    1  1.2245488  0.01618454  0.9136462
##    5  0.9510744  0.04707808  0.7153875
##    9  0.9202472  0.05854906  0.6912398
##   13  0.9063531  0.06809598  0.6797068
##   17  0.8993892  0.07407287  0.6744863
##   21  0.8952353  0.07795513  0.6708487
##   25  0.8935680  0.07919309  0.6689866
##   29  0.8916505  0.08129036  0.6672542
##   33  0.8898324  0.08372251  0.6656412
##   37  0.8890289  0.08456773  0.6648607
##   41  0.8883209  0.08543991  0.6644468
##   45  0.8874559  0.08667781  0.6638042
##   49  0.8869799  0.08731341  0.6633693
##   53  0.8861999  0.08854308  0.6627953
##   57  0.8859525  0.08883439  0.6626536
##   61  0.8858113  0.08895102  0.6627494
##   65  0.8855681  0.08930195  0.6623729
##   69  0.8853146  0.08967885  0.6623255
##   73  0.8852310  0.08975513  0.6623045
##   77  0.8849628  0.09018560  0.6621098
##   81  0.8847215  0.09058359  0.6618570
##   85  0.8844259  0.09110720  0.6616450
##   89  0.8840423  0.09179092  0.6614116
##   93  0.8840582  0.09172586  0.6614094
##   97  0.8837640  0.09226801  0.6612176
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 97.
source("F_regression_metrics.R")
plot(shares_train_knn)

The above graph shows that the best k for our data is k = 97 and the final model is 97-nearest neighbor regression model. We can see that for k=97 our model overfits on learning sample as it performs poorly on test data.

regressionMetrics(real = log(train_data$shares),
                  predicted = predict(shares_train_knn, 
                                      train_data[,selected_final_variables]))
##         MSE      RMSE       MAE     MedAE        MSLE        R2
## 1 0.7647683 0.8745103 0.6544728 0.5107568 0.009982268 0.1108046
regressionMetrics(real = log(test_data$shares),
                  predicted = predict(shares_train_knn, 
                                      test_data[,selected_final_variables]))
##         MSE      RMSE       MAE     MedAE       MSLE         R2
## 1 0.8039636 0.8966402 0.6680053 0.5158331 0.01083345 0.08548314

Ridge Regression

#lets apply ridge regression

shares_ridge <- glmnet(# x needs to be a matrix here
  x = as.matrix(train_data[, shares_numerical]),
  y = train_data$shares,
  family = "gaussian", # if not provided, the function guesses it
  # based on the distribution of the variable y
  alpha = 0) #for ridge regression

plot(shares_ridge)

As the value of lambda increases the values of regression coefficients also start to increase. Lets now provide different values of lambda and alpha must be equal to zero for ridge regression.

ctrl_cv5 <- trainControl(method = "cv", number = 5)

lambdas <- exp(log(10)*seq(-2, 9, length.out = 200))


parameters_ridge <- expand.grid(alpha = 0, lambda = lambdas)

set.seed(123456789)
shares_ridge <- train(log(shares) ~ .,
                      data = train_data %>% 
                        dplyr::select(all_of(selected_final_variables)),
                      method = "glmnet", 
                      tuneGrid = parameters_ridge,
                      trControl = ctrl_cv5)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
shares_ridge
## glmnet 
## 
## 27752 samples
##    18 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 22202, 22202, 22199, 22202, 22203 
## Resampling results across tuning parameters:
## 
##   lambda        RMSE       Rsquared    MAE      
##   1.000000e-02  0.8850103  0.08928777  0.6605707
##   1.135733e-02  0.8850103  0.08928777  0.6605707
##   1.289890e-02  0.8850103  0.08928777  0.6605707
##   1.464971e-02  0.8850101  0.08928795  0.6605766
##   1.663817e-02  0.8850110  0.08928678  0.6606038
##   1.889652e-02  0.8850125  0.08928552  0.6606356
##   2.146141e-02  0.8850150  0.08928327  0.6606720
##   2.437444e-02  0.8850187  0.08928019  0.6607134
##   2.768287e-02  0.8850240  0.08927606  0.6607607
##   3.144035e-02  0.8850315  0.08927058  0.6608154
##   3.570786e-02  0.8850416  0.08926335  0.6608780
##   4.055461e-02  0.8850552  0.08925387  0.6609494
##   4.605922e-02  0.8850731  0.08924204  0.6610320
##   5.231099e-02  0.8850964  0.08922686  0.6611276
##   5.941134e-02  0.8851269  0.08920725  0.6612377
##   6.747544e-02  0.8851659  0.08918237  0.6613642
##   7.663411e-02  0.8852157  0.08915105  0.6615115
##   8.703591e-02  0.8852783  0.08911220  0.6616801
##   9.884959e-02  0.8853569  0.08906390  0.6618717
##   1.122668e-01  0.8854545  0.08900461  0.6620892
##   1.275051e-01  0.8855745  0.08893250  0.6623355
##   1.448118e-01  0.8857216  0.08884477  0.6626184
##   1.644676e-01  0.8858996  0.08874006  0.6629417
##   1.867914e-01  0.8861139  0.08861533  0.6633081
##   2.121452e-01  0.8863703  0.08846733  0.6637263
##   2.409404e-01  0.8866726  0.08829603  0.6641942
##   2.736440e-01  0.8870284  0.08809650  0.6647164
##   3.107866e-01  0.8874432  0.08786708  0.6652971
##   3.529707e-01  0.8879204  0.08760835  0.6659375
##   4.008806e-01  0.8884687  0.08731490  0.6666456
##   4.552935e-01  0.8890905  0.08698900  0.6674185
##   5.170920e-01  0.8897894  0.08663027  0.6682622
##   5.872787e-01  0.8905724  0.08623455  0.6691752
##   6.669920e-01  0.8914345  0.08581134  0.6701485
##   7.575250e-01  0.8923821  0.08535523  0.6711937
##   8.603464e-01  0.8934158  0.08486732  0.6723118
##   9.771242e-01  0.8945222  0.08436181  0.6734693
##   1.109752e+00  0.8957098  0.08382923  0.6746762
##   1.260383e+00  0.8969680  0.08327891  0.6759269
##   1.431459e+00  0.8982820  0.08272141  0.6771992
##   1.625756e+00  0.8996559  0.08214881  0.6785054
##   1.846425e+00  0.9010663  0.08157907  0.6798339
##   2.097046e+00  0.9025075  0.08101101  0.6811771
##   2.381686e+00  0.9039740  0.08044303  0.6825214
##   2.704960e+00  0.9054322  0.07989846  0.6838359
##   3.072113e+00  0.9068907  0.07936305  0.6851355
##   3.489101e+00  0.9083336  0.07884377  0.6864085
##   3.962689e+00  0.9097336  0.07835618  0.6876382
##   4.500558e+00  0.9111020  0.07788557  0.6888406
##   5.111433e+00  0.9124179  0.07744354  0.6900009
##   5.805226e+00  0.9136715  0.07703249  0.6911043
##   6.593188e+00  0.9148700  0.07664305  0.6921523
##   7.488104e+00  0.9159908  0.07628931  0.6931237
##   8.504489e+00  0.9170453  0.07596049  0.6940336
##   9.658832e+00  0.9180335  0.07565468  0.6948806
##   1.096986e+01  0.9189382  0.07538284  0.6956542
##   1.245883e+01  0.9197796  0.07513079  0.6963742
##   1.414991e+01  0.9205529  0.07490154  0.6970374
##   1.607053e+01  0.9212540  0.07469795  0.6976383
##   1.825183e+01  0.9218980  0.07451078  0.6981902
##   2.072922e+01  0.9224790  0.07434457  0.6986872
##   2.354286e+01  0.9230046  0.07419554  0.6991350
##   2.673842e+01  0.9234819  0.07405985  0.6995408
##   3.036771e+01  0.9239059  0.07394194  0.6999020
##   3.448962e+01  0.9242891  0.07383511  0.7002288
##   3.917101e+01  0.9246337  0.07373917  0.7005228
##   4.448783e+01  0.9249382  0.07365582  0.7007827
##   5.052631e+01  0.9252124  0.07358023  0.7010170
##   5.738442e+01  0.9254563  0.07351337  0.7012256
##   6.517340e+01  0.9256722  0.07345477  0.7014102
##   7.401960e+01  0.9258657  0.07340181  0.7015756
##   8.406653e+01  0.9260359  0.07335591  0.7017208
##   9.547716e+01  0.9261874  0.07331497  0.7018497
##   1.084366e+02  0.9263225  0.07327820  0.7019645
##   1.231551e+02  0.9264405  0.07324668  0.7020645
##   1.398713e+02  0.9271491  0.07209037  0.7026629
##   1.588565e+02  0.9273364         NaN  0.7028197
##   1.804186e+02  0.9273364         NaN  0.7028197
##   2.049075e+02  0.9273364         NaN  0.7028197
##   2.327202e+02  0.9273364         NaN  0.7028197
##   2.643081e+02  0.9273364         NaN  0.7028197
##   3.001836e+02  0.9273364         NaN  0.7028197
##   3.409285e+02  0.9273364         NaN  0.7028197
##   3.872039e+02  0.9273364         NaN  0.7028197
##   4.397604e+02  0.9273364         NaN  0.7028197
##   4.994505e+02  0.9273364         NaN  0.7028197
##   5.672426e+02  0.9273364         NaN  0.7028197
##   6.442364e+02  0.9273364         NaN  0.7028197
##   7.316807e+02  0.9273364         NaN  0.7028197
##   8.309942e+02  0.9273364         NaN  0.7028197
##   9.437878e+02  0.9273364         NaN  0.7028197
##   1.071891e+03  0.9273364         NaN  0.7028197
##   1.217383e+03  0.9273364         NaN  0.7028197
##   1.382622e+03  0.9273364         NaN  0.7028197
##   1.570290e+03  0.9273364         NaN  0.7028197
##   1.783431e+03  0.9273364         NaN  0.7028197
##   2.025502e+03  0.9273364         NaN  0.7028197
##   2.300430e+03  0.9273364         NaN  0.7028197
##   2.612675e+03  0.9273364         NaN  0.7028197
##   2.967302e+03  0.9273364         NaN  0.7028197
##   3.370064e+03  0.9273364         NaN  0.7028197
##   3.827494e+03  0.9273364         NaN  0.7028197
##   4.347013e+03  0.9273364         NaN  0.7028197
##   4.937048e+03  0.9273364         NaN  0.7028197
##   5.607170e+03  0.9273364         NaN  0.7028197
##   6.368250e+03  0.9273364         NaN  0.7028197
##   7.232634e+03  0.9273364         NaN  0.7028197
##   8.214344e+03  0.9273364         NaN  0.7028197
##   9.329304e+03  0.9273364         NaN  0.7028197
##   1.059560e+04  0.9273364         NaN  0.7028197
##   1.203378e+04  0.9273364         NaN  0.7028197
##   1.366716e+04  0.9273364         NaN  0.7028197
##   1.552225e+04  0.9273364         NaN  0.7028197
##   1.762914e+04  0.9273364         NaN  0.7028197
##   2.002200e+04  0.9273364         NaN  0.7028197
##   2.273966e+04  0.9273364         NaN  0.7028197
##   2.582619e+04  0.9273364         NaN  0.7028197
##   2.933166e+04  0.9273364         NaN  0.7028197
##   3.331295e+04  0.9273364         NaN  0.7028197
##   3.783463e+04  0.9273364         NaN  0.7028197
##   4.297005e+04  0.9273364         NaN  0.7028197
##   4.880252e+04  0.9273364         NaN  0.7028197
##   5.542665e+04  0.9273364         NaN  0.7028197
##   6.294989e+04  0.9273364         NaN  0.7028197
##   7.149429e+04  0.9273364         NaN  0.7028197
##   8.119845e+04  0.9273364         NaN  0.7028197
##   9.221979e+04  0.9273364         NaN  0.7028197
##   1.047371e+05  0.9273364         NaN  0.7028197
##   1.189534e+05  0.9273364         NaN  0.7028197
##   1.350994e+05  0.9273364         NaN  0.7028197
##   1.534368e+05  0.9273364         NaN  0.7028197
##   1.742633e+05  0.9273364         NaN  0.7028197
##   1.979167e+05  0.9273364         NaN  0.7028197
##   2.247806e+05  0.9273364         NaN  0.7028197
##   2.552908e+05  0.9273364         NaN  0.7028197
##   2.899423e+05  0.9273364         NaN  0.7028197
##   3.292971e+05  0.9273364         NaN  0.7028197
##   3.739937e+05  0.9273364         NaN  0.7028197
##   4.247572e+05  0.9273364         NaN  0.7028197
##   4.824109e+05  0.9273364         NaN  0.7028197
##   5.478901e+05  0.9273364         NaN  0.7028197
##   6.222571e+05  0.9273364         NaN  0.7028197
##   7.067181e+05  0.9273364         NaN  0.7028197
##   8.026434e+05  0.9273364         NaN  0.7028197
##   9.115888e+05  0.9273364         NaN  0.7028197
##   1.035322e+06  0.9273364         NaN  0.7028197
##   1.175850e+06  0.9273364         NaN  0.7028197
##   1.335452e+06  0.9273364         NaN  0.7028197
##   1.516717e+06  0.9273364         NaN  0.7028197
##   1.722586e+06  0.9273364         NaN  0.7028197
##   1.956398e+06  0.9273364         NaN  0.7028197
##   2.221947e+06  0.9273364         NaN  0.7028197
##   2.523539e+06  0.9273364         NaN  0.7028197
##   2.866068e+06  0.9273364         NaN  0.7028197
##   3.255089e+06  0.9273364         NaN  0.7028197
##   3.696913e+06  0.9273364         NaN  0.7028197
##   4.198707e+06  0.9273364         NaN  0.7028197
##   4.768612e+06  0.9273364         NaN  0.7028197
##   5.415871e+06  0.9273364         NaN  0.7028197
##   6.150986e+06  0.9273364         NaN  0.7028197
##   6.985880e+06  0.9273364         NaN  0.7028197
##   7.934097e+06  0.9273364         NaN  0.7028197
##   9.011018e+06  0.9273364         NaN  0.7028197
##   1.023411e+07  0.9273364         NaN  0.7028197
##   1.162322e+07  0.9273364         NaN  0.7028197
##   1.320088e+07  0.9273364         NaN  0.7028197
##   1.499268e+07  0.9273364         NaN  0.7028197
##   1.702769e+07  0.9273364         NaN  0.7028197
##   1.933892e+07  0.9273364         NaN  0.7028197
##   2.196385e+07  0.9273364         NaN  0.7028197
##   2.494508e+07  0.9273364         NaN  0.7028197
##   2.833096e+07  0.9273364         NaN  0.7028197
##   3.217642e+07  0.9273364         NaN  0.7028197
##   3.654383e+07  0.9273364         NaN  0.7028197
##   4.150405e+07  0.9273364         NaN  0.7028197
##   4.713753e+07  0.9273364         NaN  0.7028197
##   5.353567e+07  0.9273364         NaN  0.7028197
##   6.080224e+07  0.9273364         NaN  0.7028197
##   6.905514e+07  0.9273364         NaN  0.7028197
##   7.842822e+07  0.9273364         NaN  0.7028197
##   8.907355e+07  0.9273364         NaN  0.7028197
##   1.011638e+08  0.9273364         NaN  0.7028197
##   1.148951e+08  0.9273364         NaN  0.7028197
##   1.304902e+08  0.9273364         NaN  0.7028197
##   1.482021e+08  0.9273364         NaN  0.7028197
##   1.683180e+08  0.9273364         NaN  0.7028197
##   1.911644e+08  0.9273364         NaN  0.7028197
##   2.171118e+08  0.9273364         NaN  0.7028197
##   2.465811e+08  0.9273364         NaN  0.7028197
##   2.800504e+08  0.9273364         NaN  0.7028197
##   3.180626e+08  0.9273364         NaN  0.7028197
##   3.612343e+08  0.9273364         NaN  0.7028197
##   4.102658e+08  0.9273364         NaN  0.7028197
##   4.659526e+08  0.9273364         NaN  0.7028197
##   5.291979e+08  0.9273364         NaN  0.7028197
##   6.010277e+08  0.9273364         NaN  0.7028197
##   6.826072e+08  0.9273364         NaN  0.7028197
##   7.752597e+08  0.9273364         NaN  0.7028197
##   8.804884e+08  0.9273364         NaN  0.7028197
##   1.000000e+09  0.9273364         NaN  0.7028197
## 
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 0.01464971.
plot(shares_ridge)

In the above plot we can see that for lower value of lamda rmse is low. Lets now provide low values of lambda and estimate the best model.

parameters_ridge <- expand.grid(alpha = 0,
                                 lambda = seq(0.0001, 0.2, 0.0001))


set.seed(123456789)

shares_ridge2 <- train(log(shares) ~ .,
                      data = train_data %>% 
                        dplyr::select(all_of(selected_final_variables)),
                      method = "glmnet", 
                      tuneGrid = parameters_ridge,
                      trControl = ctrl_cv5)

#shares_ridge2

plot(shares_ridge2)

The above graph demonstrates that lambda value upto 0.01 is feasible as after this value the RMSE starts to increase.

shares_ridge2$bestTune$lambda
## [1] 0.0143

Results and Conclusion

To compare results of different estimated models, OlS, KNN and ridge regression lets compute the predicted values for each model and summarize the error measures.

shares_model_list <- list(shares_lm2 =shares_lm2 ,
                          shares_train_knn_=shares_train_knn,
                          shares_ridge2=shares_ridge2)

shares_fitted <- shares_model_list %>%
  sapply(function(x) predict(x, newdata=train_data[, selected_final_variables])) %>%
  data.frame()

sapply(shares_fitted,
       function(x) regressionMetrics(real = log(train_data$shares), predicted = x
       )) %>%  t()
##                   MSE       RMSE      MAE       MedAE     MSLE       
## shares_lm2        0.7814003 0.8839685 0.6595661 0.5170039 0.01017652 
## shares_train_knn_ 0.7647683 0.8745103 0.6544728 0.5107568 0.009982268
## shares_ridge2     0.7814237 0.8839817 0.6597824 0.5176545 0.01017712 
##                   R2        
## shares_lm2        0.0914667 
## shares_train_knn_ 0.1108046 
## shares_ridge2     0.09143946

For OLS RMSE and R2 is 0.88 and 0.091. For KNN the values are 0.87 and 0.11 respectively. Finally, the values for ridge regression are 0.78 and 0.091. Above all, we can say that KNN performs best among all the above mentioned results because the RMSE value is low in comparison to other models.

Lets now compute the predictions for the test data and see how our results differ.

shares_forecasts <- shares_model_list %>% 
  sapply(function(x) predict(x, newdata = test_data[,selected_final_variables])) %>% 
  data.frame()

sapply(shares_forecasts,
       function(x) regressionMetrics(log(test_data$shares), x
       )) %>%  t()
##                   MSE       RMSE      MAE       MedAE     MSLE       R2        
## shares_lm2        0.803079  0.8961467 0.6632185 0.5123149 0.01079164 0.08648938
## shares_train_knn_ 0.8039636 0.8966402 0.6680053 0.5158331 0.01083345 0.08548314
## shares_ridge2     0.8030152 0.8961112 0.6633658 0.5132054 0.01079122 0.08656191

In the above output we can see that ridge regression performs well among all the models on test data. The reason is that it has the least RMSE value i.e 0.8961 and maximum R2 value 0.0865. So, in our case we will select ridge regression among all the other models to predict shares on unseen data set.