In today’s digital era, majority of us seek and share information from news and articles with the help of an internet.It is much easier to figure out which articles or news are most liked, viewed and trending on the internet or various social media platforms. Trending news or articles can make one popular and famous in a matter of few hours.It can also help social media outlets to attract more and more people and generate profits. Therefore, it is important to evaluate and see which features make news or articles gain popularity. In this regard, this project aims to predict the popularity of an online article by using several features comprehensively explained below. Furthermore we will select the best prediction machine learning model based on Root mean Square Error (RMSE) and R2.
This data set summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks. Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field)
The data set has been retrieved from the UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/Online%2BNews%2BPopularity. We would like to predict the number of shares(shares).
lets take a look at the structure of our data set by applying the glimpse function.
my_data<-read.csv("Regression.csv")
glimpse(my_data)
## Rows: 39,644
## Columns: 61
## $ url <chr> "http://mashable.com/2013/01/07/amazon-i~
## $ timedelta <int> 731, 731, 731, 731, 731, 731, 731, 731, ~
## $ n_tokens_title <int> 12, 9, 9, 9, 13, 10, 8, 12, 11, 10, 9, 1~
## $ n_tokens_content <int> 219, 255, 211, 531, 1072, 370, 960, 989,~
## $ n_unique_tokens <dbl> 0.6635945, 0.6047431, 0.5751295, 0.50378~
## $ n_non_stop_words <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ n_non_stop_unique_tokens <dbl> 0.8153846, 0.7919463, 0.6638655, 0.66563~
## $ num_hrefs <int> 4, 3, 3, 9, 19, 2, 21, 20, 2, 4, 11, 7, ~
## $ num_self_hrefs <int> 2, 1, 1, 0, 19, 2, 20, 20, 0, 1, 0, 0, 2~
## $ num_imgs <int> 1, 1, 1, 1, 20, 0, 20, 20, 0, 1, 1, 1, 1~
## $ num_videos <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2~
## $ average_token_length <dbl> 4.680365, 4.913725, 4.393365, 4.404896, ~
## $ num_keywords <int> 5, 4, 6, 7, 7, 9, 10, 9, 7, 5, 8, 7, 8, ~
## $ data_channel_is_lifestyle <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0~
## $ data_channel_is_entertainment <int> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ data_channel_is_bus <int> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ data_channel_is_socmed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ data_channel_is_tech <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0~
## $ data_channel_is_world <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0~
## $ kw_min_min <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_min <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_min <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_min_max <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_max <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_max <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_min_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ self_reference_min_shares <dbl> 496, 0, 918, 0, 545, 8500, 545, 545, 0, ~
## $ self_reference_max_shares <dbl> 496, 0, 918, 0, 16000, 8500, 16000, 1600~
## $ self_reference_avg_sharess <dbl> 496.000, 0.000, 918.000, 0.000, 3151.158~
## $ weekday_is_monday <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ weekday_is_tuesday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ weekday_is_wednesday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ weekday_is_thursday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ weekday_is_friday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ weekday_is_saturday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ weekday_is_sunday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ is_weekend <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ LDA_00 <dbl> 0.50033120, 0.79975569, 0.21779229, 0.02~
## $ LDA_01 <dbl> 0.37827893, 0.05004667, 0.03333446, 0.41~
## $ LDA_02 <dbl> 0.04000468, 0.05009625, 0.03335142, 0.49~
## $ LDA_03 <dbl> 0.04126265, 0.05010067, 0.03333354, 0.02~
## $ LDA_04 <dbl> 0.04012254, 0.05000071, 0.68218829, 0.02~
## $ global_subjectivity <dbl> 0.5216171, 0.3412458, 0.7022222, 0.42984~
## $ global_sentiment_polarity <dbl> 0.09256198, 0.14894781, 0.32333333, 0.10~
## $ global_rate_positive_words <dbl> 0.04566210, 0.04313725, 0.05687204, 0.04~
## $ global_rate_negative_words <dbl> 0.013698630, 0.015686275, 0.009478673, 0~
## $ rate_positive_words <dbl> 0.7692308, 0.7333333, 0.8571429, 0.66666~
## $ rate_negative_words <dbl> 0.2307692, 0.2666667, 0.1428571, 0.33333~
## $ avg_positive_polarity <dbl> 0.3786364, 0.2869146, 0.4958333, 0.38596~
## $ min_positive_polarity <dbl> 0.10000000, 0.03333333, 0.10000000, 0.13~
## $ max_positive_polarity <dbl> 0.7000000, 0.7000000, 1.0000000, 0.80000~
## $ avg_negative_polarity <dbl> -0.3500000, -0.1187500, -0.4666667, -0.3~
## $ min_negative_polarity <dbl> -0.6000000, -0.1250000, -0.8000000, -0.6~
## $ max_negative_polarity <dbl> -0.2000000, -0.1000000, -0.1333333, -0.1~
## $ title_subjectivity <dbl> 0.5000000, 0.0000000, 0.0000000, 0.00000~
## $ title_sentiment_polarity <dbl> -0.1875000, 0.0000000, 0.0000000, 0.0000~
## $ abs_title_subjectivity <dbl> 0.00000000, 0.50000000, 0.50000000, 0.50~
## $ abs_title_sentiment_polarity <dbl> 0.1875000, 0.0000000, 0.0000000, 0.00000~
## $ shares <int> 593, 711, 1500, 1200, 505, 855, 556, 891~
In total we have 39,644 observations and 61 features(columns). For estimating regression models we don’t require first two features i.e #url and # time delta. Lets remove them from our data set and save the changes to our final data set.
#Removing the first two non predictive features from our data set.
my_data <- my_data[, c(-1,-2)]
In our data set we have few binary variables. For the modeling purpose and to make our data more understandable lets convert those binary variables (0,1) to nominal variables (Yes, No). First we will make a list of numerical feature then from that list we will filter out binary variables and save it as binary list.
a <- sapply(my_data,
function(x)
unique(x)%>%
length()) %>%
sort()
binary.list <-names(a[a==2])
binary.list
## [1] "data_channel_is_lifestyle" "data_channel_is_entertainment"
## [3] "data_channel_is_bus" "data_channel_is_socmed"
## [5] "data_channel_is_tech" "data_channel_is_world"
## [7] "weekday_is_monday" "weekday_is_tuesday"
## [9] "weekday_is_wednesday" "weekday_is_thursday"
## [11] "weekday_is_friday" "weekday_is_saturday"
## [13] "weekday_is_sunday" "is_weekend"
transform01 <- function(my_data, binary.list){
for (i in 1:(length(binary.list))) {
my_data[, binary.list[i]] <-
ifelse(my_data[, binary.list[i]]==1, 'Yes', 'No')
}
return(my_data)
}
my_data <- transform01(my_data, binary.list)
glimpse(my_data)
## Rows: 39,644
## Columns: 59
## $ n_tokens_title <int> 12, 9, 9, 9, 13, 10, 8, 12, 11, 10, 9, 1~
## $ n_tokens_content <int> 219, 255, 211, 531, 1072, 370, 960, 989,~
## $ n_unique_tokens <dbl> 0.6635945, 0.6047431, 0.5751295, 0.50378~
## $ n_non_stop_words <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ n_non_stop_unique_tokens <dbl> 0.8153846, 0.7919463, 0.6638655, 0.66563~
## $ num_hrefs <int> 4, 3, 3, 9, 19, 2, 21, 20, 2, 4, 11, 7, ~
## $ num_self_hrefs <int> 2, 1, 1, 0, 19, 2, 20, 20, 0, 1, 0, 0, 2~
## $ num_imgs <int> 1, 1, 1, 1, 20, 0, 20, 20, 0, 1, 1, 1, 1~
## $ num_videos <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2~
## $ average_token_length <dbl> 4.680365, 4.913725, 4.393365, 4.404896, ~
## $ num_keywords <int> 5, 4, 6, 7, 7, 9, 10, 9, 7, 5, 8, 7, 8, ~
## $ data_channel_is_lifestyle <chr> "No", "No", "No", "No", "No", "No", "Yes~
## $ data_channel_is_entertainment <chr> "Yes", "No", "No", "Yes", "No", "No", "N~
## $ data_channel_is_bus <chr> "No", "Yes", "Yes", "No", "No", "No", "N~
## $ data_channel_is_socmed <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ data_channel_is_tech <chr> "No", "No", "No", "No", "Yes", "Yes", "N~
## $ data_channel_is_world <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ kw_min_min <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_min <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_min <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_min_max <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_max <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_max <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_min_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ self_reference_min_shares <dbl> 496, 0, 918, 0, 545, 8500, 545, 545, 0, ~
## $ self_reference_max_shares <dbl> 496, 0, 918, 0, 16000, 8500, 16000, 1600~
## $ self_reference_avg_sharess <dbl> 496.000, 0.000, 918.000, 0.000, 3151.158~
## $ weekday_is_monday <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"~
## $ weekday_is_tuesday <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ weekday_is_wednesday <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ weekday_is_thursday <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ weekday_is_friday <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ weekday_is_saturday <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ weekday_is_sunday <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ is_weekend <chr> "No", "No", "No", "No", "No", "No", "No"~
## $ LDA_00 <dbl> 0.50033120, 0.79975569, 0.21779229, 0.02~
## $ LDA_01 <dbl> 0.37827893, 0.05004667, 0.03333446, 0.41~
## $ LDA_02 <dbl> 0.04000468, 0.05009625, 0.03335142, 0.49~
## $ LDA_03 <dbl> 0.04126265, 0.05010067, 0.03333354, 0.02~
## $ LDA_04 <dbl> 0.04012254, 0.05000071, 0.68218829, 0.02~
## $ global_subjectivity <dbl> 0.5216171, 0.3412458, 0.7022222, 0.42984~
## $ global_sentiment_polarity <dbl> 0.09256198, 0.14894781, 0.32333333, 0.10~
## $ global_rate_positive_words <dbl> 0.04566210, 0.04313725, 0.05687204, 0.04~
## $ global_rate_negative_words <dbl> 0.013698630, 0.015686275, 0.009478673, 0~
## $ rate_positive_words <dbl> 0.7692308, 0.7333333, 0.8571429, 0.66666~
## $ rate_negative_words <dbl> 0.2307692, 0.2666667, 0.1428571, 0.33333~
## $ avg_positive_polarity <dbl> 0.3786364, 0.2869146, 0.4958333, 0.38596~
## $ min_positive_polarity <dbl> 0.10000000, 0.03333333, 0.10000000, 0.13~
## $ max_positive_polarity <dbl> 0.7000000, 0.7000000, 1.0000000, 0.80000~
## $ avg_negative_polarity <dbl> -0.3500000, -0.1187500, -0.4666667, -0.3~
## $ min_negative_polarity <dbl> -0.6000000, -0.1250000, -0.8000000, -0.6~
## $ max_negative_polarity <dbl> -0.2000000, -0.1000000, -0.1333333, -0.1~
## $ title_subjectivity <dbl> 0.5000000, 0.0000000, 0.0000000, 0.00000~
## $ title_sentiment_polarity <dbl> -0.1875000, 0.0000000, 0.0000000, 0.0000~
## $ abs_title_subjectivity <dbl> 0.00000000, 0.50000000, 0.50000000, 0.50~
## $ abs_title_sentiment_polarity <dbl> 0.1875000, 0.0000000, 0.0000000, 0.00000~
## $ shares <int> 593, 711, 1500, 1200, 505, 855, 556, 891~
Before applying various feature selection methods its better to quickly go through entire columns and see if any column has any missing values.
colSums(is.na(my_data)) %>% sort()
## n_tokens_title n_tokens_content
## 0 0
## n_unique_tokens n_non_stop_words
## 0 0
## n_non_stop_unique_tokens num_hrefs
## 0 0
## num_self_hrefs num_imgs
## 0 0
## num_videos average_token_length
## 0 0
## num_keywords data_channel_is_lifestyle
## 0 0
## data_channel_is_entertainment data_channel_is_bus
## 0 0
## data_channel_is_socmed data_channel_is_tech
## 0 0
## data_channel_is_world kw_min_min
## 0 0
## kw_max_min kw_avg_min
## 0 0
## kw_min_max kw_max_max
## 0 0
## kw_avg_max kw_min_avg
## 0 0
## kw_max_avg kw_avg_avg
## 0 0
## self_reference_min_shares self_reference_max_shares
## 0 0
## self_reference_avg_sharess weekday_is_monday
## 0 0
## weekday_is_tuesday weekday_is_wednesday
## 0 0
## weekday_is_thursday weekday_is_friday
## 0 0
## weekday_is_saturday weekday_is_sunday
## 0 0
## is_weekend LDA_00
## 0 0
## LDA_01 LDA_02
## 0 0
## LDA_03 LDA_04
## 0 0
## global_subjectivity global_sentiment_polarity
## 0 0
## global_rate_positive_words global_rate_negative_words
## 0 0
## rate_positive_words rate_negative_words
## 0 0
## avg_positive_polarity min_positive_polarity
## 0 0
## max_positive_polarity avg_negative_polarity
## 0 0
## min_negative_polarity max_negative_polarity
## 0 0
## title_subjectivity title_sentiment_polarity
## 0 0
## abs_title_subjectivity abs_title_sentiment_polarity
## 0 0
## shares
## 0
# no missing value in any column
Lets make a list of categorical variables and then convert those character variables to factor variables and store them as factors. Also make a list of numeric features and store them in an vector named shares_numerical.
shares_categorical <- sapply(my_data, is.character) %>% which() %>% names()
#converting all categorical variables to factors by applying a loop
for (variable in shares_categorical) {
my_data[[variable]] <- as.factor(my_data[[variable]])
}
#list of numeric features in a separate vector
shares_numerical <- sapply(my_data, is.numeric) %>% which () %>% names()
glimpse(my_data)
## Rows: 39,644
## Columns: 59
## $ n_tokens_title <int> 12, 9, 9, 9, 13, 10, 8, 12, 11, 10, 9, 1~
## $ n_tokens_content <int> 219, 255, 211, 531, 1072, 370, 960, 989,~
## $ n_unique_tokens <dbl> 0.6635945, 0.6047431, 0.5751295, 0.50378~
## $ n_non_stop_words <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ n_non_stop_unique_tokens <dbl> 0.8153846, 0.7919463, 0.6638655, 0.66563~
## $ num_hrefs <int> 4, 3, 3, 9, 19, 2, 21, 20, 2, 4, 11, 7, ~
## $ num_self_hrefs <int> 2, 1, 1, 0, 19, 2, 20, 20, 0, 1, 0, 0, 2~
## $ num_imgs <int> 1, 1, 1, 1, 20, 0, 20, 20, 0, 1, 1, 1, 1~
## $ num_videos <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2~
## $ average_token_length <dbl> 4.680365, 4.913725, 4.393365, 4.404896, ~
## $ num_keywords <int> 5, 4, 6, 7, 7, 9, 10, 9, 7, 5, 8, 7, 8, ~
## $ data_channel_is_lifestyle <fct> No, No, No, No, No, No, Yes, No, No, No,~
## $ data_channel_is_entertainment <fct> Yes, No, No, Yes, No, No, No, No, No, No~
## $ data_channel_is_bus <fct> No, Yes, Yes, No, No, No, No, No, No, No~
## $ data_channel_is_socmed <fct> No, No, No, No, No, No, No, No, No, No, ~
## $ data_channel_is_tech <fct> No, No, No, No, Yes, Yes, No, Yes, Yes, ~
## $ data_channel_is_world <fct> No, No, No, No, No, No, No, No, No, Yes,~
## $ kw_min_min <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_min <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_min <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_min_max <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_max <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_max <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_min_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_max_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ kw_avg_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ self_reference_min_shares <dbl> 496, 0, 918, 0, 545, 8500, 545, 545, 0, ~
## $ self_reference_max_shares <dbl> 496, 0, 918, 0, 16000, 8500, 16000, 1600~
## $ self_reference_avg_sharess <dbl> 496.000, 0.000, 918.000, 0.000, 3151.158~
## $ weekday_is_monday <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, ~
## $ weekday_is_tuesday <fct> No, No, No, No, No, No, No, No, No, No, ~
## $ weekday_is_wednesday <fct> No, No, No, No, No, No, No, No, No, No, ~
## $ weekday_is_thursday <fct> No, No, No, No, No, No, No, No, No, No, ~
## $ weekday_is_friday <fct> No, No, No, No, No, No, No, No, No, No, ~
## $ weekday_is_saturday <fct> No, No, No, No, No, No, No, No, No, No, ~
## $ weekday_is_sunday <fct> No, No, No, No, No, No, No, No, No, No, ~
## $ is_weekend <fct> No, No, No, No, No, No, No, No, No, No, ~
## $ LDA_00 <dbl> 0.50033120, 0.79975569, 0.21779229, 0.02~
## $ LDA_01 <dbl> 0.37827893, 0.05004667, 0.03333446, 0.41~
## $ LDA_02 <dbl> 0.04000468, 0.05009625, 0.03335142, 0.49~
## $ LDA_03 <dbl> 0.04126265, 0.05010067, 0.03333354, 0.02~
## $ LDA_04 <dbl> 0.04012254, 0.05000071, 0.68218829, 0.02~
## $ global_subjectivity <dbl> 0.5216171, 0.3412458, 0.7022222, 0.42984~
## $ global_sentiment_polarity <dbl> 0.09256198, 0.14894781, 0.32333333, 0.10~
## $ global_rate_positive_words <dbl> 0.04566210, 0.04313725, 0.05687204, 0.04~
## $ global_rate_negative_words <dbl> 0.013698630, 0.015686275, 0.009478673, 0~
## $ rate_positive_words <dbl> 0.7692308, 0.7333333, 0.8571429, 0.66666~
## $ rate_negative_words <dbl> 0.2307692, 0.2666667, 0.1428571, 0.33333~
## $ avg_positive_polarity <dbl> 0.3786364, 0.2869146, 0.4958333, 0.38596~
## $ min_positive_polarity <dbl> 0.10000000, 0.03333333, 0.10000000, 0.13~
## $ max_positive_polarity <dbl> 0.7000000, 0.7000000, 1.0000000, 0.80000~
## $ avg_negative_polarity <dbl> -0.3500000, -0.1187500, -0.4666667, -0.3~
## $ min_negative_polarity <dbl> -0.6000000, -0.1250000, -0.8000000, -0.6~
## $ max_negative_polarity <dbl> -0.2000000, -0.1000000, -0.1333333, -0.1~
## $ title_subjectivity <dbl> 0.5000000, 0.0000000, 0.0000000, 0.00000~
## $ title_sentiment_polarity <dbl> -0.1875000, 0.0000000, 0.0000000, 0.0000~
## $ abs_title_subjectivity <dbl> 0.00000000, 0.50000000, 0.50000000, 0.50~
## $ abs_title_sentiment_polarity <dbl> 0.1875000, 0.0000000, 0.0000000, 0.00000~
## $ shares <int> 593, 711, 1500, 1200, 505, 855, 556, 891~
#All the character variables are now converted to factor variables.
Lets divide the data into training and test sample. There are few functions that would help our cause but we will use createDataPartition() function from the caret package.
set.seed(987654321)
data_which_train <- createDataPartition(my_data$shares, # target variable
# share of the training sample
p = 0.7,
# should result be a list?
list = FALSE)
It is a vector of numbers - indexes of 70% of observations are selected , CreateDataPartition function does not create training or test sample just select number of observations for training sample. We need to apply this index for data division. lets create the division. Furthermore lets check the frequency distribution of the target variable i.e number of shares (shares)
train_data <- my_data[data_which_train, ]
test_data <- my_data[-data_which_train, ]
summary(train_data$shares)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4 946 1400 3396 2800 843300
summary(test_data$shares)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 946 1400 3393 2800 441000
Apart from minimum and maximum values other statistics are almost similar. Its important to keep in mind that all transformations taking into account of variable distributions must be done on training data.
Lets check the distribution of our outcome variable i.e number of shares (shares).
ggplot(my_data,
aes(x = shares)) +
geom_histogram(fill = "blue",
bins= 100) +
theme_bw()+
labs(title = "Histogram of Outcome variable (Shares)") +
theme(plot.title = element_text(hjust = 0.6))
As we can see the distribution of our outcome variable is rightly skewed. For modeling purpose, normally distributed data is preferred. One way to make it look normal is to take log and see if it helps.
ggplot(my_data,
aes(x = log(shares))) +
geom_histogram(fill = "red",
bins = 100) +
theme_bw()
The above histogram clearly indicates that log transformation does help a little bit as it looks more better and symmetrical in comparison to without log transformation.
We can even try boxcox and yeojhonson transformation method. Both transformations give the same lamda value. We could also see that the distribution of boxcox transformation isn’t really symmetrical so we will stick with the log transformation.
shares_boxcox <- bestNormalize::boxcox(train_data$shares)
shares_boxcox$lambda
## [1] -0.2345715
sharesboxcox <- shares_boxcox$x.t
hist(sharesboxcox)
Feature selection is the most vital part for machine learning algorithms. Because inclusion of redundant and weakly correlated variables with the outcome variable do not add anything to our model results. It could reduce the model accuracy like in classification task or could increase errors in regression task. Therefore, it is highly recommended to apply various feature selection techniques to identify redundant features and remove them. First, we will start with examining relationship of predictors with the outcome variable. And we will observe if we have strong relationship among predictors as well i.e collinearity . We must make sure that they are not strongly correlated with each other. We have both categorical and numeric features so we will do it separately.
shares_correlation <-
cor(train_data[,shares_numerical],
use = "pairwise.complete.obs")
shares_correlation %>% round(digits = 2)
## n_tokens_title n_tokens_content n_unique_tokens
## n_tokens_title 1.00 0.02 -0.01
## n_tokens_content 0.02 1.00 0.00
## n_unique_tokens -0.01 0.00 1.00
## n_non_stop_words -0.01 0.02 1.00
## n_non_stop_unique_tokens -0.01 0.00 1.00
## num_hrefs -0.05 0.42 0.00
## num_self_hrefs -0.02 0.30 0.01
## num_imgs -0.01 0.34 0.03
## num_videos 0.05 0.10 0.00
## average_token_length -0.08 0.16 0.02
## num_keywords -0.01 0.08 0.00
## kw_min_min -0.11 -0.06 0.00
## kw_max_min 0.00 0.00 0.00
## kw_avg_min -0.03 0.00 0.00
## kw_min_max 0.01 -0.03 0.00
## kw_max_max 0.12 0.06 0.00
## kw_avg_max 0.11 -0.10 0.00
## kw_min_avg -0.01 -0.02 0.01
## kw_max_avg 0.01 -0.03 0.00
## kw_avg_avg 0.00 -0.08 0.00
## self_reference_min_shares 0.00 -0.03 0.00
## self_reference_max_shares 0.00 0.03 0.00
## self_reference_avg_sharess 0.00 -0.01 0.00
## LDA_00 -0.07 0.02 0.00
## LDA_01 0.06 -0.01 0.00
## LDA_02 0.04 0.08 -0.01
## LDA_03 0.04 -0.14 0.00
## LDA_04 -0.06 0.04 -0.01
## global_subjectivity -0.06 0.13 -0.01
## global_sentiment_polarity -0.08 0.02 0.00
## global_rate_positive_words -0.07 0.13 0.00
## global_rate_negative_words 0.02 0.12 0.00
## rate_positive_words -0.07 0.10 -0.01
## rate_negative_words 0.04 0.10 0.00
## avg_positive_polarity -0.05 0.13 -0.01
## min_positive_polarity -0.02 -0.26 0.01
## max_positive_polarity -0.02 0.41 -0.01
## avg_negative_polarity -0.02 -0.13 0.00
## min_negative_polarity -0.03 -0.45 0.01
## max_negative_polarity 0.01 0.23 0.00
## title_subjectivity 0.08 0.00 -0.01
## title_sentiment_polarity 0.00 0.02 0.00
## abs_title_subjectivity -0.15 0.01 -0.01
## abs_title_sentiment_polarity 0.04 0.02 0.00
## shares 0.01 0.01 0.00
## n_non_stop_words n_non_stop_unique_tokens
## n_tokens_title -0.01 -0.01
## n_tokens_content 0.02 0.00
## n_unique_tokens 1.00 1.00
## n_non_stop_words 1.00 1.00
## n_non_stop_unique_tokens 1.00 1.00
## num_hrefs 0.00 0.00
## num_self_hrefs 0.01 0.01
## num_imgs 0.03 0.02
## num_videos 0.00 0.00
## average_token_length 0.03 0.03
## num_keywords 0.00 0.00
## kw_min_min 0.00 0.00
## kw_max_min 0.00 0.00
## kw_avg_min 0.00 0.00
## kw_min_max 0.00 0.00
## kw_max_max 0.00 0.00
## kw_avg_max 0.00 0.00
## kw_min_avg 0.00 0.00
## kw_max_avg 0.00 0.00
## kw_avg_avg 0.00 -0.01
## self_reference_min_shares 0.00 0.00
## self_reference_max_shares 0.00 0.00
## self_reference_avg_sharess 0.00 0.00
## LDA_00 0.00 0.00
## LDA_01 0.00 0.00
## LDA_02 0.00 -0.01
## LDA_03 -0.01 -0.01
## LDA_04 0.00 0.00
## global_subjectivity -0.01 0.00
## global_sentiment_polarity 0.00 0.00
## global_rate_positive_words 0.00 0.00
## global_rate_negative_words 0.00 0.00
## rate_positive_words 0.00 0.00
## rate_negative_words 0.00 0.00
## avg_positive_polarity 0.00 0.00
## min_positive_polarity 0.00 0.00
## max_positive_polarity 0.00 -0.01
## avg_negative_polarity 0.00 0.00
## min_negative_polarity 0.00 0.01
## max_negative_polarity 0.00 0.00
## title_subjectivity -0.01 -0.01
## title_sentiment_polarity 0.00 0.00
## abs_title_subjectivity -0.01 -0.01
## abs_title_sentiment_polarity 0.00 -0.01
## shares 0.00 0.00
## num_hrefs num_self_hrefs num_imgs num_videos
## n_tokens_title -0.05 -0.02 -0.01 0.05
## n_tokens_content 0.42 0.30 0.34 0.10
## n_unique_tokens 0.00 0.01 0.03 0.00
## n_non_stop_words 0.00 0.01 0.03 0.00
## n_non_stop_unique_tokens 0.00 0.01 0.02 0.00
## num_hrefs 1.00 0.39 0.35 0.11
## num_self_hrefs 0.39 1.00 0.24 0.08
## num_imgs 0.35 0.24 1.00 -0.07
## num_videos 0.11 0.08 -0.07 1.00
## average_token_length 0.22 0.13 0.04 0.00
## num_keywords 0.13 0.10 0.09 -0.02
## kw_min_min -0.05 -0.02 -0.04 0.01
## kw_max_min 0.02 0.00 0.01 0.01
## kw_avg_min 0.01 0.00 0.00 -0.01
## kw_min_max -0.02 -0.03 0.00 0.04
## kw_max_max 0.06 0.00 0.05 0.00
## kw_avg_max -0.02 -0.07 0.00 0.14
## kw_min_avg 0.06 0.04 0.09 0.07
## kw_max_avg 0.07 -0.03 0.07 0.05
## kw_avg_avg 0.12 -0.02 0.15 0.12
## self_reference_min_shares -0.01 -0.03 0.01 0.00
## self_reference_max_shares 0.08 0.13 0.04 0.08
## self_reference_avg_sharess 0.03 0.02 0.02 0.04
## LDA_00 -0.03 -0.02 -0.15 -0.07
## LDA_01 -0.05 0.00 0.05 0.02
## LDA_02 -0.01 -0.09 -0.10 -0.09
## LDA_03 0.13 -0.03 0.21 0.27
## LDA_04 -0.06 0.14 -0.02 -0.13
## global_subjectivity 0.20 0.11 0.08 0.08
## global_sentiment_polarity 0.09 0.09 0.03 -0.03
## global_rate_positive_words 0.06 0.12 -0.04 0.08
## global_rate_negative_words 0.03 0.01 0.02 0.19
## rate_positive_words 0.10 0.14 -0.02 -0.05
## rate_negative_words 0.06 -0.01 0.04 0.07
## avg_positive_polarity 0.18 0.10 0.10 0.10
## min_positive_polarity -0.08 -0.08 -0.02 -0.01
## max_positive_polarity 0.28 0.19 0.16 0.13
## avg_negative_polarity -0.15 -0.06 -0.07 -0.12
## min_negative_polarity -0.26 -0.12 -0.13 -0.14
## max_negative_polarity 0.06 0.04 0.04 0.03
## title_subjectivity 0.04 -0.01 0.06 0.07
## title_sentiment_polarity 0.04 0.03 0.05 0.02
## abs_title_subjectivity 0.01 0.01 -0.01 -0.03
## abs_title_sentiment_polarity 0.06 0.00 0.07 0.06
## shares 0.04 0.00 0.03 0.02
## average_token_length num_keywords kw_min_min
## n_tokens_title -0.08 -0.01 -0.11
## n_tokens_content 0.16 0.08 -0.06
## n_unique_tokens 0.02 0.00 0.00
## n_non_stop_words 0.03 0.00 0.00
## n_non_stop_unique_tokens 0.03 0.00 0.00
## num_hrefs 0.22 0.13 -0.05
## num_self_hrefs 0.13 0.10 -0.02
## num_imgs 0.04 0.09 -0.04
## num_videos 0.00 -0.02 0.01
## average_token_length 1.00 -0.02 0.03
## num_keywords -0.02 1.00 0.00
## kw_min_min 0.03 0.00 1.00
## kw_max_min -0.01 0.08 0.01
## kw_avg_min 0.01 0.08 0.12
## kw_min_max -0.04 -0.28 -0.06
## kw_max_max -0.04 0.01 -0.85
## kw_avg_max -0.16 -0.33 -0.50
## kw_min_avg -0.08 -0.27 -0.14
## kw_max_avg -0.05 0.12 -0.08
## kw_avg_avg -0.14 0.02 -0.19
## self_reference_min_shares 0.04 0.00 -0.03
## self_reference_max_shares 0.04 0.01 -0.04
## self_reference_avg_sharess 0.04 0.01 -0.04
## LDA_00 0.08 -0.16 0.03
## LDA_01 -0.02 -0.06 0.01
## LDA_02 0.10 -0.01 -0.07
## LDA_03 -0.18 0.02 -0.01
## LDA_04 0.03 0.18 0.05
## global_subjectivity 0.59 0.04 0.04
## global_sentiment_polarity 0.18 0.08 0.08
## global_rate_positive_words 0.32 0.05 0.10
## global_rate_negative_words 0.23 -0.04 -0.01
## rate_positive_words 0.57 0.03 0.09
## rate_negative_words 0.32 -0.07 -0.05
## avg_positive_polarity 0.54 0.04 0.04
## min_positive_polarity 0.22 -0.01 0.02
## max_positive_polarity 0.47 0.07 0.02
## avg_negative_polarity -0.32 0.02 0.02
## min_negative_polarity -0.28 -0.01 0.06
## max_negative_polarity -0.19 0.03 -0.03
## title_subjectivity -0.03 0.01 0.00
## title_sentiment_polarity -0.01 0.03 0.03
## abs_title_subjectivity 0.02 0.00 0.01
## abs_title_sentiment_polarity -0.03 0.02 0.01
## shares -0.03 0.02 0.00
## kw_max_min kw_avg_min kw_min_max kw_max_max
## n_tokens_title 0.00 -0.03 0.01 0.12
## n_tokens_content 0.00 0.00 -0.03 0.06
## n_unique_tokens 0.00 0.00 0.00 0.00
## n_non_stop_words 0.00 0.00 0.00 0.00
## n_non_stop_unique_tokens 0.00 0.00 0.00 0.00
## num_hrefs 0.02 0.01 -0.02 0.06
## num_self_hrefs 0.00 0.00 -0.03 0.00
## num_imgs 0.01 0.00 0.00 0.05
## num_videos 0.01 -0.01 0.04 0.00
## average_token_length -0.01 0.01 -0.04 -0.04
## num_keywords 0.08 0.08 -0.28 0.01
## kw_min_min 0.01 0.12 -0.06 -0.85
## kw_max_min 1.00 0.94 -0.04 0.00
## kw_avg_min 0.94 1.00 -0.07 -0.11
## kw_min_max -0.04 -0.07 1.00 0.07
## kw_max_max 0.00 -0.11 0.07 1.00
## kw_avg_max -0.04 -0.14 0.42 0.56
## kw_min_avg 0.02 -0.02 0.35 0.16
## kw_max_avg 0.53 0.49 0.04 0.10
## kw_avg_avg 0.38 0.34 0.18 0.23
## self_reference_min_shares 0.02 0.01 0.01 0.03
## self_reference_max_shares 0.04 0.03 0.03 0.05
## self_reference_avg_sharess 0.03 0.02 0.02 0.05
## LDA_00 0.00 0.03 0.05 -0.03
## LDA_01 -0.01 -0.01 -0.02 -0.01
## LDA_02 -0.02 -0.04 -0.06 0.07
## LDA_03 0.03 0.01 0.08 0.02
## LDA_04 0.00 0.01 -0.06 -0.05
## global_subjectivity 0.01 0.02 -0.02 -0.04
## global_sentiment_polarity 0.01 0.03 -0.01 -0.09
## global_rate_positive_words 0.00 0.03 -0.01 -0.11
## global_rate_negative_words -0.01 -0.01 0.00 0.02
## rate_positive_words 0.00 0.03 -0.03 -0.10
## rate_negative_words -0.01 -0.02 -0.01 0.06
## avg_positive_polarity 0.01 0.02 -0.01 -0.04
## min_positive_polarity 0.01 0.01 0.00 -0.02
## max_positive_polarity 0.01 0.02 -0.02 -0.02
## avg_negative_polarity 0.00 0.00 0.00 -0.02
## min_negative_polarity 0.00 0.01 0.01 -0.06
## max_negative_polarity 0.01 0.00 -0.01 0.03
## title_subjectivity 0.01 0.00 0.03 0.00
## title_sentiment_polarity 0.01 0.01 0.01 -0.03
## abs_title_subjectivity 0.00 0.01 -0.02 -0.01
## abs_title_sentiment_polarity 0.01 0.01 0.03 -0.01
## shares 0.04 0.04 0.01 0.00
## kw_avg_max kw_min_avg kw_max_avg kw_avg_avg
## n_tokens_title 0.11 -0.01 0.01 0.00
## n_tokens_content -0.10 -0.02 -0.03 -0.08
## n_unique_tokens 0.00 0.01 0.00 0.00
## n_non_stop_words 0.00 0.00 0.00 0.00
## n_non_stop_unique_tokens 0.00 0.00 0.00 -0.01
## num_hrefs -0.02 0.06 0.07 0.12
## num_self_hrefs -0.07 0.04 -0.03 -0.02
## num_imgs 0.00 0.09 0.07 0.15
## num_videos 0.14 0.07 0.05 0.12
## average_token_length -0.16 -0.08 -0.05 -0.14
## num_keywords -0.33 -0.27 0.12 0.02
## kw_min_min -0.50 -0.14 -0.08 -0.19
## kw_max_min -0.04 0.02 0.53 0.38
## kw_avg_min -0.14 -0.02 0.49 0.34
## kw_min_max 0.42 0.35 0.04 0.18
## kw_max_max 0.56 0.16 0.10 0.23
## kw_avg_max 1.00 0.41 0.15 0.44
## kw_min_avg 0.41 1.00 0.10 0.47
## kw_max_avg 0.15 0.10 1.00 0.81
## kw_avg_avg 0.44 0.47 0.81 1.00
## self_reference_min_shares 0.05 0.05 0.10 0.12
## self_reference_max_shares 0.09 0.06 0.21 0.20
## self_reference_avg_sharess 0.09 0.06 0.19 0.19
## LDA_00 0.12 0.00 -0.02 -0.04
## LDA_01 -0.08 -0.01 0.00 0.01
## LDA_02 -0.12 -0.13 -0.11 -0.28
## LDA_03 0.30 0.19 0.22 0.45
## LDA_04 -0.23 -0.05 -0.09 -0.16
## global_subjectivity -0.06 0.02 0.04 0.07
## global_sentiment_polarity -0.08 0.00 0.01 0.02
## global_rate_positive_words -0.11 0.00 -0.01 -0.01
## global_rate_negative_words 0.01 0.02 0.01 0.03
## rate_positive_words -0.16 -0.06 -0.04 -0.10
## rate_negative_words 0.00 -0.01 -0.01 -0.03
## avg_positive_polarity -0.07 0.01 0.03 0.05
## min_positive_polarity 0.01 0.02 0.03 0.06
## max_positive_polarity -0.10 -0.01 0.01 0.00
## avg_negative_polarity -0.02 -0.02 -0.04 -0.07
## min_negative_polarity 0.01 -0.01 -0.02 -0.02
## max_negative_polarity -0.02 -0.01 -0.02 -0.04
## title_subjectivity 0.04 0.05 0.05 0.09
## title_sentiment_polarity 0.00 0.03 0.02 0.04
## abs_title_subjectivity -0.02 -0.02 -0.01 -0.02
## abs_title_sentiment_polarity 0.03 0.06 0.05 0.09
## shares 0.04 0.03 0.07 0.11
## self_reference_min_shares
## n_tokens_title 0.00
## n_tokens_content -0.03
## n_unique_tokens 0.00
## n_non_stop_words 0.00
## n_non_stop_unique_tokens 0.00
## num_hrefs -0.01
## num_self_hrefs -0.03
## num_imgs 0.01
## num_videos 0.00
## average_token_length 0.04
## num_keywords 0.00
## kw_min_min -0.03
## kw_max_min 0.02
## kw_avg_min 0.01
## kw_min_max 0.01
## kw_max_max 0.03
## kw_avg_max 0.05
## kw_min_avg 0.05
## kw_max_avg 0.10
## kw_avg_avg 0.12
## self_reference_min_shares 1.00
## self_reference_max_shares 0.45
## self_reference_avg_sharess 0.79
## LDA_00 0.00
## LDA_01 -0.01
## LDA_02 -0.05
## LDA_03 0.04
## LDA_04 0.01
## global_subjectivity 0.06
## global_sentiment_polarity 0.01
## global_rate_positive_words 0.01
## global_rate_negative_words 0.02
## rate_positive_words 0.02
## rate_negative_words 0.02
## avg_positive_polarity 0.05
## min_positive_polarity 0.04
## max_positive_polarity 0.01
## avg_negative_polarity -0.04
## min_negative_polarity -0.01
## max_negative_polarity -0.05
## title_subjectivity 0.01
## title_sentiment_polarity 0.01
## abs_title_subjectivity 0.00
## abs_title_sentiment_polarity 0.01
## shares 0.07
## self_reference_max_shares
## n_tokens_title 0.00
## n_tokens_content 0.03
## n_unique_tokens 0.00
## n_non_stop_words 0.00
## n_non_stop_unique_tokens 0.00
## num_hrefs 0.08
## num_self_hrefs 0.13
## num_imgs 0.04
## num_videos 0.08
## average_token_length 0.04
## num_keywords 0.01
## kw_min_min -0.04
## kw_max_min 0.04
## kw_avg_min 0.03
## kw_min_max 0.03
## kw_max_max 0.05
## kw_avg_max 0.09
## kw_min_avg 0.06
## kw_max_avg 0.21
## kw_avg_avg 0.20
## self_reference_min_shares 0.45
## self_reference_max_shares 1.00
## self_reference_avg_sharess 0.86
## LDA_00 0.01
## LDA_01 -0.02
## LDA_02 -0.05
## LDA_03 0.05
## LDA_04 0.00
## global_subjectivity 0.07
## global_sentiment_polarity 0.01
## global_rate_positive_words 0.02
## global_rate_negative_words 0.02
## rate_positive_words 0.02
## rate_negative_words 0.02
## avg_positive_polarity 0.05
## min_positive_polarity 0.00
## max_positive_polarity 0.04
## avg_negative_polarity -0.06
## min_negative_polarity -0.05
## max_negative_polarity -0.02
## title_subjectivity 0.02
## title_sentiment_polarity 0.00
## abs_title_subjectivity 0.00
## abs_title_sentiment_polarity 0.01
## shares 0.06
## self_reference_avg_sharess LDA_00 LDA_01 LDA_02
## n_tokens_title 0.00 -0.07 0.06 0.04
## n_tokens_content -0.01 0.02 -0.01 0.08
## n_unique_tokens 0.00 0.00 0.00 -0.01
## n_non_stop_words 0.00 0.00 0.00 0.00
## n_non_stop_unique_tokens 0.00 0.00 0.00 -0.01
## num_hrefs 0.03 -0.03 -0.05 -0.01
## num_self_hrefs 0.02 -0.02 0.00 -0.09
## num_imgs 0.02 -0.15 0.05 -0.10
## num_videos 0.04 -0.07 0.02 -0.09
## average_token_length 0.04 0.08 -0.02 0.10
## num_keywords 0.01 -0.16 -0.06 -0.01
## kw_min_min -0.04 0.03 0.01 -0.07
## kw_max_min 0.03 0.00 -0.01 -0.02
## kw_avg_min 0.02 0.03 -0.01 -0.04
## kw_min_max 0.02 0.05 -0.02 -0.06
## kw_max_max 0.05 -0.03 -0.01 0.07
## kw_avg_max 0.09 0.12 -0.08 -0.12
## kw_min_avg 0.06 0.00 -0.01 -0.13
## kw_max_avg 0.19 -0.02 0.00 -0.11
## kw_avg_avg 0.19 -0.04 0.01 -0.28
## self_reference_min_shares 0.79 0.00 -0.01 -0.05
## self_reference_max_shares 0.86 0.01 -0.02 -0.05
## self_reference_avg_sharess 1.00 0.00 -0.01 -0.05
## LDA_00 0.00 1.00 -0.18 -0.26
## LDA_01 -0.01 -0.18 1.00 -0.23
## LDA_02 -0.05 -0.26 -0.23 1.00
## LDA_03 0.05 -0.27 -0.12 -0.33
## LDA_04 0.00 -0.24 -0.25 -0.23
## global_subjectivity 0.08 -0.01 0.05 -0.17
## global_sentiment_polarity 0.01 0.11 -0.01 -0.22
## global_rate_positive_words 0.01 0.15 0.06 -0.23
## global_rate_negative_words 0.03 -0.08 0.09 0.02
## rate_positive_words 0.02 0.16 -0.01 -0.14
## rate_negative_words 0.03 -0.12 0.03 0.19
## avg_positive_polarity 0.05 0.00 0.05 -0.14
## min_positive_polarity 0.03 -0.11 0.02 -0.06
## max_positive_polarity 0.03 0.05 0.04 -0.11
## avg_negative_polarity -0.06 0.05 -0.10 0.03
## min_negative_polarity -0.03 0.04 -0.06 -0.07
## max_negative_polarity -0.04 -0.01 -0.04 0.08
## title_subjectivity 0.01 -0.05 0.06 -0.07
## title_sentiment_polarity 0.00 0.03 -0.01 -0.07
## abs_title_subjectivity 0.00 0.00 -0.07 0.05
## abs_title_sentiment_polarity 0.01 -0.03 0.04 -0.08
## shares 0.07 0.00 -0.01 -0.06
## LDA_03 LDA_04 global_subjectivity
## n_tokens_title 0.04 -0.06 -0.06
## n_tokens_content -0.14 0.04 0.13
## n_unique_tokens 0.00 -0.01 -0.01
## n_non_stop_words -0.01 0.00 -0.01
## n_non_stop_unique_tokens -0.01 0.00 0.00
## num_hrefs 0.13 -0.06 0.20
## num_self_hrefs -0.03 0.14 0.11
## num_imgs 0.21 -0.02 0.08
## num_videos 0.27 -0.13 0.08
## average_token_length -0.18 0.03 0.59
## num_keywords 0.02 0.18 0.04
## kw_min_min -0.01 0.05 0.04
## kw_max_min 0.03 0.00 0.01
## kw_avg_min 0.01 0.01 0.02
## kw_min_max 0.08 -0.06 -0.02
## kw_max_max 0.02 -0.05 -0.04
## kw_avg_max 0.30 -0.23 -0.06
## kw_min_avg 0.19 -0.05 0.02
## kw_max_avg 0.22 -0.09 0.04
## kw_avg_avg 0.45 -0.16 0.07
## self_reference_min_shares 0.04 0.01 0.06
## self_reference_max_shares 0.05 0.00 0.07
## self_reference_avg_sharess 0.05 0.00 0.08
## LDA_00 -0.27 -0.24 -0.01
## LDA_01 -0.12 -0.25 0.05
## LDA_02 -0.33 -0.23 -0.17
## LDA_03 1.00 -0.36 0.08
## LDA_04 -0.36 1.00 0.05
## global_subjectivity 0.08 0.05 1.00
## global_sentiment_polarity -0.01 0.13 0.34
## global_rate_positive_words -0.05 0.09 0.46
## global_rate_negative_words 0.07 -0.10 0.25
## rate_positive_words -0.17 0.17 0.48
## rate_negative_words 0.01 -0.11 0.13
## avg_positive_polarity 0.08 0.02 0.63
## min_positive_polarity 0.11 0.02 0.25
## max_positive_polarity -0.01 0.04 0.51
## avg_negative_polarity -0.12 0.12 -0.45
## min_negative_polarity -0.03 0.10 -0.35
## max_negative_polarity -0.08 0.04 -0.22
## title_subjectivity 0.12 -0.05 0.12
## title_sentiment_polarity 0.03 0.02 0.05
## abs_title_subjectivity -0.01 0.01 -0.01
## abs_title_sentiment_polarity 0.11 -0.04 0.09
## shares 0.08 -0.01 0.03
## global_sentiment_polarity
## n_tokens_title -0.08
## n_tokens_content 0.02
## n_unique_tokens 0.00
## n_non_stop_words 0.00
## n_non_stop_unique_tokens 0.00
## num_hrefs 0.09
## num_self_hrefs 0.09
## num_imgs 0.03
## num_videos -0.03
## average_token_length 0.18
## num_keywords 0.08
## kw_min_min 0.08
## kw_max_min 0.01
## kw_avg_min 0.03
## kw_min_max -0.01
## kw_max_max -0.09
## kw_avg_max -0.08
## kw_min_avg 0.00
## kw_max_avg 0.01
## kw_avg_avg 0.02
## self_reference_min_shares 0.01
## self_reference_max_shares 0.01
## self_reference_avg_sharess 0.01
## LDA_00 0.11
## LDA_01 -0.01
## LDA_02 -0.22
## LDA_03 -0.01
## LDA_04 0.13
## global_subjectivity 0.34
## global_sentiment_polarity 1.00
## global_rate_positive_words 0.57
## global_rate_negative_words -0.47
## rate_positive_words 0.73
## rate_negative_words -0.65
## avg_positive_polarity 0.50
## min_positive_polarity 0.09
## max_positive_polarity 0.43
## avg_negative_polarity 0.23
## min_negative_polarity 0.28
## max_negative_polarity -0.05
## title_subjectivity 0.02
## title_sentiment_polarity 0.24
## abs_title_subjectivity -0.02
## abs_title_sentiment_polarity 0.07
## shares 0.00
## global_rate_positive_words
## n_tokens_title -0.07
## n_tokens_content 0.13
## n_unique_tokens 0.00
## n_non_stop_words 0.00
## n_non_stop_unique_tokens 0.00
## num_hrefs 0.06
## num_self_hrefs 0.12
## num_imgs -0.04
## num_videos 0.08
## average_token_length 0.32
## num_keywords 0.05
## kw_min_min 0.10
## kw_max_min 0.00
## kw_avg_min 0.03
## kw_min_max -0.01
## kw_max_max -0.11
## kw_avg_max -0.11
## kw_min_avg 0.00
## kw_max_avg -0.01
## kw_avg_avg -0.01
## self_reference_min_shares 0.01
## self_reference_max_shares 0.02
## self_reference_avg_sharess 0.01
## LDA_00 0.15
## LDA_01 0.06
## LDA_02 -0.23
## LDA_03 -0.05
## LDA_04 0.09
## global_subjectivity 0.46
## global_sentiment_polarity 0.57
## global_rate_positive_words 1.00
## global_rate_negative_words 0.11
## rate_positive_words 0.63
## rate_negative_words -0.33
## avg_positive_polarity 0.33
## min_positive_polarity -0.11
## max_positive_polarity 0.48
## avg_negative_polarity -0.14
## min_negative_polarity -0.12
## max_negative_polarity -0.08
## title_subjectivity 0.11
## title_sentiment_polarity 0.14
## abs_title_subjectivity -0.13
## abs_title_sentiment_polarity 0.10
## shares 0.00
## global_rate_negative_words rate_positive_words
## n_tokens_title 0.02 -0.07
## n_tokens_content 0.12 0.10
## n_unique_tokens 0.00 -0.01
## n_non_stop_words 0.00 0.00
## n_non_stop_unique_tokens 0.00 0.00
## num_hrefs 0.03 0.10
## num_self_hrefs 0.01 0.14
## num_imgs 0.02 -0.02
## num_videos 0.19 -0.05
## average_token_length 0.23 0.57
## num_keywords -0.04 0.03
## kw_min_min -0.01 0.09
## kw_max_min -0.01 0.00
## kw_avg_min -0.01 0.03
## kw_min_max 0.00 -0.03
## kw_max_max 0.02 -0.10
## kw_avg_max 0.01 -0.16
## kw_min_avg 0.02 -0.06
## kw_max_avg 0.01 -0.04
## kw_avg_avg 0.03 -0.10
## self_reference_min_shares 0.02 0.02
## self_reference_max_shares 0.02 0.02
## self_reference_avg_sharess 0.03 0.02
## LDA_00 -0.08 0.16
## LDA_01 0.09 -0.01
## LDA_02 0.02 -0.14
## LDA_03 0.07 -0.17
## LDA_04 -0.10 0.17
## global_subjectivity 0.25 0.48
## global_sentiment_polarity -0.47 0.73
## global_rate_positive_words 0.11 0.63
## global_rate_negative_words 1.00 -0.40
## rate_positive_words -0.40 1.00
## rate_negative_words 0.78 -0.54
## avg_positive_polarity 0.19 0.41
## min_positive_polarity 0.06 0.05
## max_positive_polarity 0.18 0.45
## avg_negative_polarity -0.34 -0.04
## min_negative_polarity -0.47 0.09
## max_negative_polarity 0.07 -0.20
## title_subjectivity 0.09 -0.02
## title_sentiment_polarity -0.13 0.15
## abs_title_subjectivity -0.06 -0.02
## abs_title_sentiment_polarity 0.06 0.00
## shares 0.00 -0.01
## rate_negative_words avg_positive_polarity
## n_tokens_title 0.04 -0.05
## n_tokens_content 0.10 0.13
## n_unique_tokens 0.00 -0.01
## n_non_stop_words 0.00 0.00
## n_non_stop_unique_tokens 0.00 0.00
## num_hrefs 0.06 0.18
## num_self_hrefs -0.01 0.10
## num_imgs 0.04 0.10
## num_videos 0.07 0.10
## average_token_length 0.32 0.54
## num_keywords -0.07 0.04
## kw_min_min -0.05 0.04
## kw_max_min -0.01 0.01
## kw_avg_min -0.02 0.02
## kw_min_max -0.01 -0.01
## kw_max_max 0.06 -0.04
## kw_avg_max 0.00 -0.07
## kw_min_avg -0.01 0.01
## kw_max_avg -0.01 0.03
## kw_avg_avg -0.03 0.05
## self_reference_min_shares 0.02 0.05
## self_reference_max_shares 0.02 0.05
## self_reference_avg_sharess 0.03 0.05
## LDA_00 -0.12 0.00
## LDA_01 0.03 0.05
## LDA_02 0.19 -0.14
## LDA_03 0.01 0.08
## LDA_04 -0.11 0.02
## global_subjectivity 0.13 0.63
## global_sentiment_polarity -0.65 0.50
## global_rate_positive_words -0.33 0.33
## global_rate_negative_words 0.78 0.19
## rate_positive_words -0.54 0.41
## rate_negative_words 1.00 0.14
## avg_positive_polarity 0.14 1.00
## min_positive_polarity 0.19 0.46
## max_positive_polarity 0.03 0.70
## avg_negative_polarity -0.34 -0.27
## min_negative_polarity -0.45 -0.23
## max_negative_polarity 0.03 -0.14
## title_subjectivity 0.00 0.06
## title_sentiment_polarity -0.18 0.09
## abs_title_subjectivity 0.03 0.01
## abs_title_sentiment_polarity -0.02 0.10
## shares -0.01 0.01
## min_positive_polarity max_positive_polarity
## n_tokens_title -0.02 -0.02
## n_tokens_content -0.26 0.41
## n_unique_tokens 0.01 -0.01
## n_non_stop_words 0.00 0.00
## n_non_stop_unique_tokens 0.00 -0.01
## num_hrefs -0.08 0.28
## num_self_hrefs -0.08 0.19
## num_imgs -0.02 0.16
## num_videos -0.01 0.13
## average_token_length 0.22 0.47
## num_keywords -0.01 0.07
## kw_min_min 0.02 0.02
## kw_max_min 0.01 0.01
## kw_avg_min 0.01 0.02
## kw_min_max 0.00 -0.02
## kw_max_max -0.02 -0.02
## kw_avg_max 0.01 -0.10
## kw_min_avg 0.02 -0.01
## kw_max_avg 0.03 0.01
## kw_avg_avg 0.06 0.00
## self_reference_min_shares 0.04 0.01
## self_reference_max_shares 0.00 0.04
## self_reference_avg_sharess 0.03 0.03
## LDA_00 -0.11 0.05
## LDA_01 0.02 0.04
## LDA_02 -0.06 -0.11
## LDA_03 0.11 -0.01
## LDA_04 0.02 0.04
## global_subjectivity 0.25 0.51
## global_sentiment_polarity 0.09 0.43
## global_rate_positive_words -0.11 0.48
## global_rate_negative_words 0.06 0.18
## rate_positive_words 0.05 0.45
## rate_negative_words 0.19 0.03
## avg_positive_polarity 0.46 0.70
## min_positive_polarity 1.00 0.01
## max_positive_polarity 0.01 1.00
## avg_negative_polarity -0.07 -0.27
## min_negative_polarity 0.09 -0.36
## max_negative_polarity -0.19 0.00
## title_subjectivity 0.01 0.06
## title_sentiment_polarity 0.00 0.08
## abs_title_subjectivity 0.02 -0.02
## abs_title_sentiment_polarity 0.02 0.08
## shares 0.00 0.01
## avg_negative_polarity min_negative_polarity
## n_tokens_title -0.02 -0.03
## n_tokens_content -0.13 -0.45
## n_unique_tokens 0.00 0.01
## n_non_stop_words 0.00 0.00
## n_non_stop_unique_tokens 0.00 0.01
## num_hrefs -0.15 -0.26
## num_self_hrefs -0.06 -0.12
## num_imgs -0.07 -0.13
## num_videos -0.12 -0.14
## average_token_length -0.32 -0.28
## num_keywords 0.02 -0.01
## kw_min_min 0.02 0.06
## kw_max_min 0.00 0.00
## kw_avg_min 0.00 0.01
## kw_min_max 0.00 0.01
## kw_max_max -0.02 -0.06
## kw_avg_max -0.02 0.01
## kw_min_avg -0.02 -0.01
## kw_max_avg -0.04 -0.02
## kw_avg_avg -0.07 -0.02
## self_reference_min_shares -0.04 -0.01
## self_reference_max_shares -0.06 -0.05
## self_reference_avg_sharess -0.06 -0.03
## LDA_00 0.05 0.04
## LDA_01 -0.10 -0.06
## LDA_02 0.03 -0.07
## LDA_03 -0.12 -0.03
## LDA_04 0.12 0.10
## global_subjectivity -0.45 -0.35
## global_sentiment_polarity 0.23 0.28
## global_rate_positive_words -0.14 -0.12
## global_rate_negative_words -0.34 -0.47
## rate_positive_words -0.04 0.09
## rate_negative_words -0.34 -0.45
## avg_positive_polarity -0.27 -0.23
## min_positive_polarity -0.07 0.09
## max_positive_polarity -0.27 -0.36
## avg_negative_polarity 1.00 0.75
## min_negative_polarity 0.75 1.00
## max_negative_polarity 0.58 0.08
## title_subjectivity -0.08 -0.06
## title_sentiment_polarity 0.08 0.08
## abs_title_subjectivity -0.01 -0.01
## abs_title_sentiment_polarity -0.08 -0.06
## shares -0.03 -0.02
## max_negative_polarity title_subjectivity
## n_tokens_title 0.01 0.08
## n_tokens_content 0.23 0.00
## n_unique_tokens 0.00 -0.01
## n_non_stop_words 0.00 -0.01
## n_non_stop_unique_tokens 0.00 -0.01
## num_hrefs 0.06 0.04
## num_self_hrefs 0.04 -0.01
## num_imgs 0.04 0.06
## num_videos 0.03 0.07
## average_token_length -0.19 -0.03
## num_keywords 0.03 0.01
## kw_min_min -0.03 0.00
## kw_max_min 0.01 0.01
## kw_avg_min 0.00 0.00
## kw_min_max -0.01 0.03
## kw_max_max 0.03 0.00
## kw_avg_max -0.02 0.04
## kw_min_avg -0.01 0.05
## kw_max_avg -0.02 0.05
## kw_avg_avg -0.04 0.09
## self_reference_min_shares -0.05 0.01
## self_reference_max_shares -0.02 0.02
## self_reference_avg_sharess -0.04 0.01
## LDA_00 -0.01 -0.05
## LDA_01 -0.04 0.06
## LDA_02 0.08 -0.07
## LDA_03 -0.08 0.12
## LDA_04 0.04 -0.05
## global_subjectivity -0.22 0.12
## global_sentiment_polarity -0.05 0.02
## global_rate_positive_words -0.08 0.11
## global_rate_negative_words 0.07 0.09
## rate_positive_words -0.20 -0.02
## rate_negative_words 0.03 0.00
## avg_positive_polarity -0.14 0.06
## min_positive_polarity -0.19 0.01
## max_positive_polarity 0.00 0.06
## avg_negative_polarity 0.58 -0.08
## min_negative_polarity 0.08 -0.06
## max_negative_polarity 1.00 -0.02
## title_subjectivity -0.02 1.00
## title_sentiment_polarity 0.00 0.24
## abs_title_subjectivity 0.00 -0.49
## abs_title_sentiment_polarity -0.02 0.71
## shares -0.02 0.01
## title_sentiment_polarity abs_title_subjectivity
## n_tokens_title 0.00 -0.15
## n_tokens_content 0.02 0.01
## n_unique_tokens 0.00 -0.01
## n_non_stop_words 0.00 -0.01
## n_non_stop_unique_tokens 0.00 -0.01
## num_hrefs 0.04 0.01
## num_self_hrefs 0.03 0.01
## num_imgs 0.05 -0.01
## num_videos 0.02 -0.03
## average_token_length -0.01 0.02
## num_keywords 0.03 0.00
## kw_min_min 0.03 0.01
## kw_max_min 0.01 0.00
## kw_avg_min 0.01 0.01
## kw_min_max 0.01 -0.02
## kw_max_max -0.03 -0.01
## kw_avg_max 0.00 -0.02
## kw_min_avg 0.03 -0.02
## kw_max_avg 0.02 -0.01
## kw_avg_avg 0.04 -0.02
## self_reference_min_shares 0.01 0.00
## self_reference_max_shares 0.00 0.00
## self_reference_avg_sharess 0.00 0.00
## LDA_00 0.03 0.00
## LDA_01 -0.01 -0.07
## LDA_02 -0.07 0.05
## LDA_03 0.03 -0.01
## LDA_04 0.02 0.01
## global_subjectivity 0.05 -0.01
## global_sentiment_polarity 0.24 -0.02
## global_rate_positive_words 0.14 -0.13
## global_rate_negative_words -0.13 -0.06
## rate_positive_words 0.15 -0.02
## rate_negative_words -0.18 0.03
## avg_positive_polarity 0.09 0.01
## min_positive_polarity 0.00 0.02
## max_positive_polarity 0.08 -0.02
## avg_negative_polarity 0.08 -0.01
## min_negative_polarity 0.08 -0.01
## max_negative_polarity 0.00 0.00
## title_subjectivity 0.24 -0.49
## title_sentiment_polarity 1.00 -0.24
## abs_title_subjectivity -0.24 1.00
## abs_title_sentiment_polarity 0.42 -0.40
## shares 0.01 0.01
## abs_title_sentiment_polarity shares
## n_tokens_title 0.04 0.01
## n_tokens_content 0.02 0.01
## n_unique_tokens 0.00 0.00
## n_non_stop_words 0.00 0.00
## n_non_stop_unique_tokens -0.01 0.00
## num_hrefs 0.06 0.04
## num_self_hrefs 0.00 0.00
## num_imgs 0.07 0.03
## num_videos 0.06 0.02
## average_token_length -0.03 -0.03
## num_keywords 0.02 0.02
## kw_min_min 0.01 0.00
## kw_max_min 0.01 0.04
## kw_avg_min 0.01 0.04
## kw_min_max 0.03 0.01
## kw_max_max -0.01 0.00
## kw_avg_max 0.03 0.04
## kw_min_avg 0.06 0.03
## kw_max_avg 0.05 0.07
## kw_avg_avg 0.09 0.11
## self_reference_min_shares 0.01 0.07
## self_reference_max_shares 0.01 0.06
## self_reference_avg_sharess 0.01 0.07
## LDA_00 -0.03 0.00
## LDA_01 0.04 -0.01
## LDA_02 -0.08 -0.06
## LDA_03 0.11 0.08
## LDA_04 -0.04 -0.01
## global_subjectivity 0.09 0.03
## global_sentiment_polarity 0.07 0.00
## global_rate_positive_words 0.10 0.00
## global_rate_negative_words 0.06 0.00
## rate_positive_words 0.00 -0.01
## rate_negative_words -0.02 -0.01
## avg_positive_polarity 0.10 0.01
## min_positive_polarity 0.02 0.00
## max_positive_polarity 0.08 0.01
## avg_negative_polarity -0.08 -0.03
## min_negative_polarity -0.06 -0.02
## max_negative_polarity -0.02 -0.02
## title_subjectivity 0.71 0.01
## title_sentiment_polarity 0.42 0.01
## abs_title_subjectivity -0.40 0.01
## abs_title_sentiment_polarity 1.00 0.02
## shares 0.02 1.00
Its a long matrix and not easy to read. We can even visualize it by using corrplot function. But we will use the alternative function from the caret package: findCorrelation(x = correlation_matrix, cutoff = 0.9) which identifies correlations above the accepted threshold and indicates variables to be deleted. By default it returns a list of column numbers to be deleted. lets decrease the cutoff to 0.75 abve 0.75 there could be some who are equally correlated but function returns only one because it compares both with other predictors and the one thats average correlation is higher is suggested to be removed.
findCorrelation(shares_correlation,
cutoff = 0.75,
names = TRUE) -> shares_crr
shares_crr
## [1] "kw_avg_avg" "rate_negative_words"
## [3] "kw_min_min" "self_reference_max_shares"
## [5] "kw_max_min" "self_reference_min_shares"
## [7] "n_non_stop_words" "n_non_stop_unique_tokens"
#list names of potential variables to be removed
As the target variable is quantitative and explanatory qualitative, one can use analysis of variance (ANOVA). We want to check whether e.g our outcome variable shares differ for our predictor i.e weekday_is_friday which have two levels (Yes and No) whether the shares differ for different levels. The F statistic or the p-value can be used to varify this hypothesis.The null hypothesis is that: H0: weekday_is_friday does not impact the shares i.e. average shares does not differ for different values of weekday_is_friday. Ha: Weekday_is_friday does impact shares The higher the F-statistic (or the lower its p-value) the stronger we reject H0. We will use a function and extract only the p- value for all of our categorical variables. After that we will sort them in the descending order based on p-value.
shares_F_anova <- function(shares_categorical){
anova_ <- aov(train_data$shares ~
train_data[[shares_categorical]])
return(summary(anova_)[[1]][1, 5])
}
sapply(shares_categorical,
shares_F_anova) %>%
# in addition lets sort them
# in the decreasing order of F
# and store as an object
sort(decreasing = TRUE) -> shares_anova_all_categorical
shares_anova_all_categorical
## weekday_is_friday data_channel_is_socmed
## 7.452427e-01 3.538015e-01
## weekday_is_wednesday weekday_is_thursday
## 2.979947e-01 2.420819e-01
## data_channel_is_lifestyle weekday_is_sunday
## 1.738559e-01 1.588156e-01
## weekday_is_tuesday weekday_is_monday
## 1.303174e-01 1.099716e-01
## data_channel_is_tech data_channel_is_bus
## 7.059908e-02 3.349409e-02
## weekday_is_saturday is_weekend
## 1.244472e-02 4.440307e-03
## data_channel_is_entertainment data_channel_is_world
## 3.747287e-03 9.025154e-16
From the above p-values we could see that apart from data_channel_is_bus , weekday_is_saturday is_weekend, data_channel_is_entertainment and data_channel_is_world remaining predictors do not impact our outcome variable as their p-values area greater then 0.05 which indicates that for these predictors we will accept null hypothesis which states that they do not impact shares. So we will just include those predictors in our analysis for which the p-value is less then 0.05.
ggplot(train_data,
aes(x = data_channel_is_entertainment ,
y = log(shares))) +
geom_boxplot(fill = "blue") +
theme_bw()+
labs(title = "Boxplot of log(shares) vs. data_channel_is_entertainment")+
theme(plot.title = element_text(hjust = 0.5))
Identification of linear relationships in the data findLinearCombos(data) is a very useful function, which decomposes the data matrix to identify groups of variables that are linear combinations of other variables we will treat such variables as redundant because since they can be determined from the others, they bring nothing to the analysis lets check it for numeric variables.
( findLinearCombos(train_data[, shares_numerical] ) ->
houses_linearCombos )
## $linearCombos
## list()
##
## $remove
## NULL
It suggests we dont have any linear combinations in our training data thats why it retruns null.
We obviously don’t want high concentration of features in a single value issue in data zero variance.to To identify such variables we can use the function nearZeroVar(data, saveMetrics = FALSE) from the caret package.
nearZeroVar(train_data,
saveMetrics = TRUE) -> shares_nzv_stats
shares_nzv_stats %>%
# we add rownames of the frame
# (with names of variables)
# as a new column in the data
rownames_to_column("variable") %>%
# and sort it in the descreasing order
arrange(-zeroVar, -nzv, -freqRatio)
## variable freqRatio percentUnique zeroVar nzv
## 1 kw_min_max 44.906015 3.390746613 FALSE TRUE
## 2 n_non_stop_words 33.294190 0.014413376 FALSE TRUE
## 3 kw_min_avg 114.807692 43.110406457 FALSE FALSE
## 4 global_sentiment_polarity 37.409091 89.031421159 FALSE FALSE
## 5 global_subjectivity 33.833333 88.732343615 FALSE FALSE
## 6 global_rate_negative_words 30.368421 30.300518882 FALSE FALSE
## 7 data_channel_is_lifestyle 17.713419 0.007206688 FALSE FALSE
## 8 data_channel_is_socmed 16.205208 0.007206688 FALSE FALSE
## 9 weekday_is_saturday 14.976972 0.007206688 FALSE FALSE
## 10 average_token_length 13.483333 80.210435284 FALSE FALSE
## 11 weekday_is_sunday 13.261048 0.007206688 FALSE FALSE
## 12 self_reference_avg_sharess 12.080488 23.807293168 FALSE FALSE
## 13 n_tokens_content 11.724638 7.909339867 FALSE FALSE
## 14 kw_avg_avg 9.833333 99.243297780 FALSE FALSE
## 15 global_rate_positive_words 9.511364 38.173825310 FALSE FALSE
## 16 title_sentiment_polarity 9.460285 2.489910637 FALSE FALSE
## 17 self_reference_max_shares 9.089908 3.823147881 FALSE FALSE
## 18 kw_avg_min 7.934426 49.409051600 FALSE FALSE
## 19 abs_title_subjectivity 7.821311 1.650331508 FALSE FALSE
## 20 n_unique_tokens 7.632075 61.833381378 FALSE FALSE
## 21 kw_max_max 7.598422 0.093686941 FALSE FALSE
## 22 abs_title_sentiment_polarity 7.565147 2.014269242 FALSE FALSE
## 23 title_subjectivity 6.891197 2.082732776 FALSE FALSE
## 24 is_weekend 6.535162 0.007206688 FALSE FALSE
## 25 weekday_is_friday 6.043655 0.007206688 FALSE FALSE
## 26 avg_positive_polarity 5.978571 71.908330931 FALSE FALSE
## 27 n_non_stop_unique_tokens 5.905109 47.171375036 FALSE FALSE
## 28 data_channel_is_bus 5.249043 0.007206688 FALSE FALSE
## 29 weekday_is_monday 4.965606 0.007206688 FALSE FALSE
## 30 avg_negative_polarity 4.835196 38.105361776 FALSE FALSE
## 31 data_channel_is_entertainment 4.616677 0.007206688 FALSE FALSE
## 32 weekday_is_thursday 4.473767 0.007206688 FALSE FALSE
## 33 data_channel_is_tech 4.451188 0.007206688 FALSE FALSE
## 34 weekday_is_wednesday 4.337950 0.007206688 FALSE FALSE
## 35 weekday_is_tuesday 4.328725 0.007206688 FALSE FALSE
## 36 self_reference_min_shares 4.076543 4.284375901 FALSE FALSE
## 37 data_channel_is_world 3.683091 0.007206688 FALSE FALSE
## 38 num_videos 2.654242 0.187373883 FALSE FALSE
## 39 num_imgs 2.600328 0.320697607 FALSE FALSE
## 40 LDA_04 2.400000 72.427212453 FALSE FALSE
## 41 max_positive_polarity 2.241922 0.136927068 FALSE FALSE
## 42 min_positive_polarity 2.067956 0.115307005 FALSE FALSE
## 43 kw_min_min 1.928623 0.064860190 FALSE FALSE
## 44 kw_max_avg 1.735294 54.435716345 FALSE FALSE
## 45 rate_negative_words 1.492241 7.087777457 FALSE FALSE
## 46 LDA_00 1.341463 75.583741712 FALSE FALSE
## 47 kw_avg_max 1.340909 82.131017584 FALSE FALSE
## 48 kw_max_min 1.300000 3.761891035 FALSE FALSE
## 49 rate_positive_words 1.262242 7.087777457 FALSE FALSE
## 50 min_negative_polarity 1.258532 0.190977227 FALSE FALSE
## 51 LDA_02 1.147059 75.972902854 FALSE FALSE
## 52 shares 1.144587 4.788844047 FALSE FALSE
## 53 max_negative_polarity 1.131511 0.176563851 FALSE FALSE
## 54 num_self_hrefs 1.119714 0.201787259 FALSE FALSE
## 55 LDA_03 1.115385 76.585471317 FALSE FALSE
## 56 num_keywords 1.088715 0.036033439 FALSE FALSE
## 57 n_tokens_title 1.044948 0.072066878 FALSE FALSE
## 58 num_hrefs 1.022993 0.450417988 FALSE FALSE
## 59 LDA_01 1.000000 77.673681176 FALSE FALSE
Just two variables have the near zero variance i.e kw_min_max and n_non_stop_words.
lets now create a two list of selected_variables by removing the nzv, mutually correlated with each other and weakly correlated with outcome variable
# start with all
selected_variables <- names(train_data)
#Lets take the first 15 numeric features correlated with the outcome variable and the last 4 categorical variables based on the p-values
selected_variables <- c(correlation_order[1:25],
names(shares_anova_all_categorical)[10:14]
)
#Now lets delet mutullay correlated variables as identified by findcorrelation function
selected_variables <-
selected_variables [!selected_variables %in%
shares_crr]
(shares_variables_nzv <- nearZeroVar(train_data,
names = TRUE))
## [1] "n_non_stop_words" "kw_min_max"
selected_variables <-
selected_variables [!selected_variables %in%
shares_variables_nzv]
selected_variables
## [1] "shares" "LDA_03"
## [3] "self_reference_avg_sharess" "kw_max_avg"
## [5] "num_hrefs" "kw_avg_max"
## [7] "kw_avg_min" "kw_min_avg"
## [9] "num_imgs" "global_subjectivity"
## [11] "num_videos" "num_keywords"
## [13] "abs_title_sentiment_polarity" "title_subjectivity"
## [15] "n_tokens_title" "title_sentiment_polarity"
## [17] "avg_positive_polarity" "max_positive_polarity"
## [19] "n_tokens_content" "abs_title_subjectivity"
## [21] "data_channel_is_bus" "weekday_is_saturday"
## [23] "is_weekend" "data_channel_is_entertainment"
## [25] "data_channel_is_world"
shares_lm1 <- lm(log(shares) ~ .,
data = train_data %>%
dplyr::select(all_of(selected_variables)))
vifs <- ols_vif_tol(shares_lm1)
vifs[which(vifs$Variables %in% shares_numerical),]
## Variables Tolerance VIF
## 1 LDA_03 0.6078630 1.645107
## 2 self_reference_avg_sharess 0.9449376 1.058271
## 3 kw_max_avg 0.6437783 1.553330
## 4 num_hrefs 0.7221733 1.384709
## 5 kw_avg_max 0.5949289 1.680873
## 6 kw_avg_min 0.7028196 1.422840
## 7 kw_min_avg 0.7731303 1.293443
## 8 num_imgs 0.7274738 1.374620
## 9 global_subjectivity 0.5662949 1.765864
## 10 num_videos 0.8453049 1.183005
## 11 num_keywords 0.7772105 1.286653
## 12 abs_title_sentiment_polarity 0.4203891 2.378749
## 13 title_subjectivity 0.4235306 2.361104
## 14 n_tokens_title 0.9275080 1.078158
## 15 title_sentiment_polarity 0.7973968 1.254081
## 16 avg_positive_polarity 0.3823022 2.615731
## 17 max_positive_polarity 0.3893029 2.568694
## 18 n_tokens_content 0.5896351 1.695964
## 19 abs_title_subjectivity 0.7196300 1.389603
In the above output we can see variance inflation factor (VIF) value. The VIF for all the predictors is small its not that high. Now lets apply the backward elimination method and find out the redundant features and remove them.
shares_lm1_backward_p <- ols_step_backward_p(shares_lm1,
prem = 0.05,
progress = F)
summary(shares_lm1_backward_p$model)
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1734 -0.5658 -0.1768 0.4129 5.8924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.058e+00 4.620e-02 152.789 < 2e-16 ***
## LDA_03 1.500e-01 2.129e-02 7.045 1.90e-12 ***
## self_reference_avg_sharess 2.579e-06 2.396e-07 10.762 < 2e-16 ***
## kw_max_avg 9.629e-06 9.816e-07 9.809 < 2e-16 ***
## num_hrefs 4.359e-03 5.459e-04 7.984 1.47e-15 ***
## kw_min_avg 5.069e-05 5.031e-06 10.075 < 2e-16 ***
## num_imgs 2.526e-03 7.256e-04 3.481 0.000500 ***
## global_subjectivity 3.610e-01 6.029e-02 5.987 2.16e-09 ***
## num_keywords 1.835e-02 3.030e-03 6.055 1.42e-09 ***
## title_subjectivity 8.892e-02 1.937e-02 4.591 4.43e-06 ***
## n_tokens_title 6.041e-03 2.583e-03 2.338 0.019372 *
## title_sentiment_polarity 8.104e-02 2.095e-02 3.869 0.000109 ***
## avg_positive_polarity -3.490e-01 6.617e-02 -5.275 1.34e-07 ***
## n_tokens_content 3.050e-05 1.338e-05 2.280 0.022633 *
## abs_title_subjectivity 1.437e-01 3.318e-02 4.332 1.48e-05 ***
## data_channel_is_busYes -1.568e-01 1.661e-02 -9.437 < 2e-16 ***
## is_weekendYes 2.680e-01 1.578e-02 16.985 < 2e-16 ***
## data_channel_is_entertainmentYes -3.538e-01 1.540e-02 -22.978 < 2e-16 ***
## data_channel_is_worldYes -3.768e-01 1.520e-02 -24.790 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8843 on 27733 degrees of freedom
## Multiple R-squared: 0.09147, Adjusted R-squared: 0.09088
## F-statistic: 155.1 on 18 and 27733 DF, p-value: < 2.2e-16
shares_lm1_backward_p$removed
## [1] "weekday_is_saturday" "max_positive_polarity"
## [3] "kw_avg_min" "num_videos"
## [5] "abs_title_sentiment_polarity" "kw_avg_max"
At the end we have a list of variables that are suggested to be removed. based on the backward elimination method. Now lets remove the identified features and make a list of selected_final_variables to be used for regression purpose.
selected_final_variables <- selected_variables[-which(selected_variables %in%
c("weekday_is_saturday", "max_positive_polarity", "kw_avg_min", "num_videos","abs_title_sentiment_polarity", "kw_avg_max"))]
shares_lm2 <- lm(log(shares) ~ .,
data = train_data %>%
dplyr::select(all_of(selected_final_variables)))
summary(shares_lm2)
##
## Call:
## lm(formula = log(shares) ~ ., data = train_data %>% dplyr::select(all_of(selected_final_variables)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1734 -0.5658 -0.1768 0.4129 5.8924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.058e+00 4.620e-02 152.789 < 2e-16 ***
## LDA_03 1.500e-01 2.129e-02 7.045 1.90e-12 ***
## self_reference_avg_sharess 2.579e-06 2.396e-07 10.762 < 2e-16 ***
## kw_max_avg 9.629e-06 9.816e-07 9.809 < 2e-16 ***
## num_hrefs 4.359e-03 5.459e-04 7.984 1.47e-15 ***
## kw_min_avg 5.069e-05 5.031e-06 10.075 < 2e-16 ***
## num_imgs 2.526e-03 7.256e-04 3.481 0.000500 ***
## global_subjectivity 3.610e-01 6.029e-02 5.987 2.16e-09 ***
## num_keywords 1.835e-02 3.030e-03 6.055 1.42e-09 ***
## title_subjectivity 8.892e-02 1.937e-02 4.591 4.43e-06 ***
## n_tokens_title 6.041e-03 2.583e-03 2.338 0.019372 *
## title_sentiment_polarity 8.104e-02 2.095e-02 3.869 0.000109 ***
## avg_positive_polarity -3.490e-01 6.617e-02 -5.275 1.34e-07 ***
## n_tokens_content 3.050e-05 1.338e-05 2.280 0.022633 *
## abs_title_subjectivity 1.437e-01 3.318e-02 4.332 1.48e-05 ***
## data_channel_is_busYes -1.568e-01 1.661e-02 -9.437 < 2e-16 ***
## is_weekendYes 2.680e-01 1.578e-02 16.985 < 2e-16 ***
## data_channel_is_entertainmentYes -3.538e-01 1.540e-02 -22.978 < 2e-16 ***
## data_channel_is_worldYes -3.768e-01 1.520e-02 -24.790 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8843 on 27733 degrees of freedom
## Multiple R-squared: 0.09147, Adjusted R-squared: 0.09088
## F-statistic: 155.1 on 18 and 27733 DF, p-value: < 2.2e-16
The above output also shows the regression estimates. We can see all out selected_final_variables bear a significant relationship with the outcome variable.
shares_predicted <- predict(shares_lm2)
ggplot(data.frame(error = log(train_data$shares) - shares_predicted),
aes(x = error)) +
geom_histogram(fill = "blue", bins = 100) +
theme_bw() +
labs(title = "Distribution of errors")+
theme(plot.title = element_text(hjust = 0.5))
The above histogram presents a distribution of errors. One can see its a bit right skewed distribution.
#applying knn
different_k <- data.frame(k = seq(1, 99, 4))
trControl10 <- trainControl(method = 'cv',
number = 5)
set.seed(987654321)
shares_train_knn <-
train(log(shares) ~ .,
data = train_data%>%
dplyr::select(all_of(selected_final_variables)),
# model type - now knn!!
method = "knn",
# train control
trControl = trControl10,
# we give the parameter(s)
# required by the model
tuneGrid = different_k)
shares_train_knn
## k-Nearest Neighbors
##
## 27752 samples
## 18 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 22201, 22202, 22200, 22202, 22203
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 1.2245488 0.01618454 0.9136462
## 5 0.9510744 0.04707808 0.7153875
## 9 0.9202472 0.05854906 0.6912398
## 13 0.9063531 0.06809598 0.6797068
## 17 0.8993892 0.07407287 0.6744863
## 21 0.8952353 0.07795513 0.6708487
## 25 0.8935680 0.07919309 0.6689866
## 29 0.8916505 0.08129036 0.6672542
## 33 0.8898324 0.08372251 0.6656412
## 37 0.8890289 0.08456773 0.6648607
## 41 0.8883209 0.08543991 0.6644468
## 45 0.8874559 0.08667781 0.6638042
## 49 0.8869799 0.08731341 0.6633693
## 53 0.8861999 0.08854308 0.6627953
## 57 0.8859525 0.08883439 0.6626536
## 61 0.8858113 0.08895102 0.6627494
## 65 0.8855681 0.08930195 0.6623729
## 69 0.8853146 0.08967885 0.6623255
## 73 0.8852310 0.08975513 0.6623045
## 77 0.8849628 0.09018560 0.6621098
## 81 0.8847215 0.09058359 0.6618570
## 85 0.8844259 0.09110720 0.6616450
## 89 0.8840423 0.09179092 0.6614116
## 93 0.8840582 0.09172586 0.6614094
## 97 0.8837640 0.09226801 0.6612176
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 97.
source("F_regression_metrics.R")
plot(shares_train_knn)
The above graph shows that the best k for our data is k = 97 and the final model is 97-nearest neighbor regression model. We can see that for k=97 our model overfits on learning sample as it performs poorly on test data.
regressionMetrics(real = log(train_data$shares),
predicted = predict(shares_train_knn,
train_data[,selected_final_variables]))
## MSE RMSE MAE MedAE MSLE R2
## 1 0.7647683 0.8745103 0.6544728 0.5107568 0.009982268 0.1108046
regressionMetrics(real = log(test_data$shares),
predicted = predict(shares_train_knn,
test_data[,selected_final_variables]))
## MSE RMSE MAE MedAE MSLE R2
## 1 0.8039636 0.8966402 0.6680053 0.5158331 0.01083345 0.08548314
#lets apply ridge regression
shares_ridge <- glmnet(# x needs to be a matrix here
x = as.matrix(train_data[, shares_numerical]),
y = train_data$shares,
family = "gaussian", # if not provided, the function guesses it
# based on the distribution of the variable y
alpha = 0) #for ridge regression
plot(shares_ridge)
As the value of lambda increases the values of regression coefficients also start to increase. Lets now provide different values of lambda and alpha must be equal to zero for ridge regression.
ctrl_cv5 <- trainControl(method = "cv", number = 5)
lambdas <- exp(log(10)*seq(-2, 9, length.out = 200))
parameters_ridge <- expand.grid(alpha = 0, lambda = lambdas)
set.seed(123456789)
shares_ridge <- train(log(shares) ~ .,
data = train_data %>%
dplyr::select(all_of(selected_final_variables)),
method = "glmnet",
tuneGrid = parameters_ridge,
trControl = ctrl_cv5)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
shares_ridge
## glmnet
##
## 27752 samples
## 18 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 22202, 22202, 22199, 22202, 22203
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 1.000000e-02 0.8850103 0.08928777 0.6605707
## 1.135733e-02 0.8850103 0.08928777 0.6605707
## 1.289890e-02 0.8850103 0.08928777 0.6605707
## 1.464971e-02 0.8850101 0.08928795 0.6605766
## 1.663817e-02 0.8850110 0.08928678 0.6606038
## 1.889652e-02 0.8850125 0.08928552 0.6606356
## 2.146141e-02 0.8850150 0.08928327 0.6606720
## 2.437444e-02 0.8850187 0.08928019 0.6607134
## 2.768287e-02 0.8850240 0.08927606 0.6607607
## 3.144035e-02 0.8850315 0.08927058 0.6608154
## 3.570786e-02 0.8850416 0.08926335 0.6608780
## 4.055461e-02 0.8850552 0.08925387 0.6609494
## 4.605922e-02 0.8850731 0.08924204 0.6610320
## 5.231099e-02 0.8850964 0.08922686 0.6611276
## 5.941134e-02 0.8851269 0.08920725 0.6612377
## 6.747544e-02 0.8851659 0.08918237 0.6613642
## 7.663411e-02 0.8852157 0.08915105 0.6615115
## 8.703591e-02 0.8852783 0.08911220 0.6616801
## 9.884959e-02 0.8853569 0.08906390 0.6618717
## 1.122668e-01 0.8854545 0.08900461 0.6620892
## 1.275051e-01 0.8855745 0.08893250 0.6623355
## 1.448118e-01 0.8857216 0.08884477 0.6626184
## 1.644676e-01 0.8858996 0.08874006 0.6629417
## 1.867914e-01 0.8861139 0.08861533 0.6633081
## 2.121452e-01 0.8863703 0.08846733 0.6637263
## 2.409404e-01 0.8866726 0.08829603 0.6641942
## 2.736440e-01 0.8870284 0.08809650 0.6647164
## 3.107866e-01 0.8874432 0.08786708 0.6652971
## 3.529707e-01 0.8879204 0.08760835 0.6659375
## 4.008806e-01 0.8884687 0.08731490 0.6666456
## 4.552935e-01 0.8890905 0.08698900 0.6674185
## 5.170920e-01 0.8897894 0.08663027 0.6682622
## 5.872787e-01 0.8905724 0.08623455 0.6691752
## 6.669920e-01 0.8914345 0.08581134 0.6701485
## 7.575250e-01 0.8923821 0.08535523 0.6711937
## 8.603464e-01 0.8934158 0.08486732 0.6723118
## 9.771242e-01 0.8945222 0.08436181 0.6734693
## 1.109752e+00 0.8957098 0.08382923 0.6746762
## 1.260383e+00 0.8969680 0.08327891 0.6759269
## 1.431459e+00 0.8982820 0.08272141 0.6771992
## 1.625756e+00 0.8996559 0.08214881 0.6785054
## 1.846425e+00 0.9010663 0.08157907 0.6798339
## 2.097046e+00 0.9025075 0.08101101 0.6811771
## 2.381686e+00 0.9039740 0.08044303 0.6825214
## 2.704960e+00 0.9054322 0.07989846 0.6838359
## 3.072113e+00 0.9068907 0.07936305 0.6851355
## 3.489101e+00 0.9083336 0.07884377 0.6864085
## 3.962689e+00 0.9097336 0.07835618 0.6876382
## 4.500558e+00 0.9111020 0.07788557 0.6888406
## 5.111433e+00 0.9124179 0.07744354 0.6900009
## 5.805226e+00 0.9136715 0.07703249 0.6911043
## 6.593188e+00 0.9148700 0.07664305 0.6921523
## 7.488104e+00 0.9159908 0.07628931 0.6931237
## 8.504489e+00 0.9170453 0.07596049 0.6940336
## 9.658832e+00 0.9180335 0.07565468 0.6948806
## 1.096986e+01 0.9189382 0.07538284 0.6956542
## 1.245883e+01 0.9197796 0.07513079 0.6963742
## 1.414991e+01 0.9205529 0.07490154 0.6970374
## 1.607053e+01 0.9212540 0.07469795 0.6976383
## 1.825183e+01 0.9218980 0.07451078 0.6981902
## 2.072922e+01 0.9224790 0.07434457 0.6986872
## 2.354286e+01 0.9230046 0.07419554 0.6991350
## 2.673842e+01 0.9234819 0.07405985 0.6995408
## 3.036771e+01 0.9239059 0.07394194 0.6999020
## 3.448962e+01 0.9242891 0.07383511 0.7002288
## 3.917101e+01 0.9246337 0.07373917 0.7005228
## 4.448783e+01 0.9249382 0.07365582 0.7007827
## 5.052631e+01 0.9252124 0.07358023 0.7010170
## 5.738442e+01 0.9254563 0.07351337 0.7012256
## 6.517340e+01 0.9256722 0.07345477 0.7014102
## 7.401960e+01 0.9258657 0.07340181 0.7015756
## 8.406653e+01 0.9260359 0.07335591 0.7017208
## 9.547716e+01 0.9261874 0.07331497 0.7018497
## 1.084366e+02 0.9263225 0.07327820 0.7019645
## 1.231551e+02 0.9264405 0.07324668 0.7020645
## 1.398713e+02 0.9271491 0.07209037 0.7026629
## 1.588565e+02 0.9273364 NaN 0.7028197
## 1.804186e+02 0.9273364 NaN 0.7028197
## 2.049075e+02 0.9273364 NaN 0.7028197
## 2.327202e+02 0.9273364 NaN 0.7028197
## 2.643081e+02 0.9273364 NaN 0.7028197
## 3.001836e+02 0.9273364 NaN 0.7028197
## 3.409285e+02 0.9273364 NaN 0.7028197
## 3.872039e+02 0.9273364 NaN 0.7028197
## 4.397604e+02 0.9273364 NaN 0.7028197
## 4.994505e+02 0.9273364 NaN 0.7028197
## 5.672426e+02 0.9273364 NaN 0.7028197
## 6.442364e+02 0.9273364 NaN 0.7028197
## 7.316807e+02 0.9273364 NaN 0.7028197
## 8.309942e+02 0.9273364 NaN 0.7028197
## 9.437878e+02 0.9273364 NaN 0.7028197
## 1.071891e+03 0.9273364 NaN 0.7028197
## 1.217383e+03 0.9273364 NaN 0.7028197
## 1.382622e+03 0.9273364 NaN 0.7028197
## 1.570290e+03 0.9273364 NaN 0.7028197
## 1.783431e+03 0.9273364 NaN 0.7028197
## 2.025502e+03 0.9273364 NaN 0.7028197
## 2.300430e+03 0.9273364 NaN 0.7028197
## 2.612675e+03 0.9273364 NaN 0.7028197
## 2.967302e+03 0.9273364 NaN 0.7028197
## 3.370064e+03 0.9273364 NaN 0.7028197
## 3.827494e+03 0.9273364 NaN 0.7028197
## 4.347013e+03 0.9273364 NaN 0.7028197
## 4.937048e+03 0.9273364 NaN 0.7028197
## 5.607170e+03 0.9273364 NaN 0.7028197
## 6.368250e+03 0.9273364 NaN 0.7028197
## 7.232634e+03 0.9273364 NaN 0.7028197
## 8.214344e+03 0.9273364 NaN 0.7028197
## 9.329304e+03 0.9273364 NaN 0.7028197
## 1.059560e+04 0.9273364 NaN 0.7028197
## 1.203378e+04 0.9273364 NaN 0.7028197
## 1.366716e+04 0.9273364 NaN 0.7028197
## 1.552225e+04 0.9273364 NaN 0.7028197
## 1.762914e+04 0.9273364 NaN 0.7028197
## 2.002200e+04 0.9273364 NaN 0.7028197
## 2.273966e+04 0.9273364 NaN 0.7028197
## 2.582619e+04 0.9273364 NaN 0.7028197
## 2.933166e+04 0.9273364 NaN 0.7028197
## 3.331295e+04 0.9273364 NaN 0.7028197
## 3.783463e+04 0.9273364 NaN 0.7028197
## 4.297005e+04 0.9273364 NaN 0.7028197
## 4.880252e+04 0.9273364 NaN 0.7028197
## 5.542665e+04 0.9273364 NaN 0.7028197
## 6.294989e+04 0.9273364 NaN 0.7028197
## 7.149429e+04 0.9273364 NaN 0.7028197
## 8.119845e+04 0.9273364 NaN 0.7028197
## 9.221979e+04 0.9273364 NaN 0.7028197
## 1.047371e+05 0.9273364 NaN 0.7028197
## 1.189534e+05 0.9273364 NaN 0.7028197
## 1.350994e+05 0.9273364 NaN 0.7028197
## 1.534368e+05 0.9273364 NaN 0.7028197
## 1.742633e+05 0.9273364 NaN 0.7028197
## 1.979167e+05 0.9273364 NaN 0.7028197
## 2.247806e+05 0.9273364 NaN 0.7028197
## 2.552908e+05 0.9273364 NaN 0.7028197
## 2.899423e+05 0.9273364 NaN 0.7028197
## 3.292971e+05 0.9273364 NaN 0.7028197
## 3.739937e+05 0.9273364 NaN 0.7028197
## 4.247572e+05 0.9273364 NaN 0.7028197
## 4.824109e+05 0.9273364 NaN 0.7028197
## 5.478901e+05 0.9273364 NaN 0.7028197
## 6.222571e+05 0.9273364 NaN 0.7028197
## 7.067181e+05 0.9273364 NaN 0.7028197
## 8.026434e+05 0.9273364 NaN 0.7028197
## 9.115888e+05 0.9273364 NaN 0.7028197
## 1.035322e+06 0.9273364 NaN 0.7028197
## 1.175850e+06 0.9273364 NaN 0.7028197
## 1.335452e+06 0.9273364 NaN 0.7028197
## 1.516717e+06 0.9273364 NaN 0.7028197
## 1.722586e+06 0.9273364 NaN 0.7028197
## 1.956398e+06 0.9273364 NaN 0.7028197
## 2.221947e+06 0.9273364 NaN 0.7028197
## 2.523539e+06 0.9273364 NaN 0.7028197
## 2.866068e+06 0.9273364 NaN 0.7028197
## 3.255089e+06 0.9273364 NaN 0.7028197
## 3.696913e+06 0.9273364 NaN 0.7028197
## 4.198707e+06 0.9273364 NaN 0.7028197
## 4.768612e+06 0.9273364 NaN 0.7028197
## 5.415871e+06 0.9273364 NaN 0.7028197
## 6.150986e+06 0.9273364 NaN 0.7028197
## 6.985880e+06 0.9273364 NaN 0.7028197
## 7.934097e+06 0.9273364 NaN 0.7028197
## 9.011018e+06 0.9273364 NaN 0.7028197
## 1.023411e+07 0.9273364 NaN 0.7028197
## 1.162322e+07 0.9273364 NaN 0.7028197
## 1.320088e+07 0.9273364 NaN 0.7028197
## 1.499268e+07 0.9273364 NaN 0.7028197
## 1.702769e+07 0.9273364 NaN 0.7028197
## 1.933892e+07 0.9273364 NaN 0.7028197
## 2.196385e+07 0.9273364 NaN 0.7028197
## 2.494508e+07 0.9273364 NaN 0.7028197
## 2.833096e+07 0.9273364 NaN 0.7028197
## 3.217642e+07 0.9273364 NaN 0.7028197
## 3.654383e+07 0.9273364 NaN 0.7028197
## 4.150405e+07 0.9273364 NaN 0.7028197
## 4.713753e+07 0.9273364 NaN 0.7028197
## 5.353567e+07 0.9273364 NaN 0.7028197
## 6.080224e+07 0.9273364 NaN 0.7028197
## 6.905514e+07 0.9273364 NaN 0.7028197
## 7.842822e+07 0.9273364 NaN 0.7028197
## 8.907355e+07 0.9273364 NaN 0.7028197
## 1.011638e+08 0.9273364 NaN 0.7028197
## 1.148951e+08 0.9273364 NaN 0.7028197
## 1.304902e+08 0.9273364 NaN 0.7028197
## 1.482021e+08 0.9273364 NaN 0.7028197
## 1.683180e+08 0.9273364 NaN 0.7028197
## 1.911644e+08 0.9273364 NaN 0.7028197
## 2.171118e+08 0.9273364 NaN 0.7028197
## 2.465811e+08 0.9273364 NaN 0.7028197
## 2.800504e+08 0.9273364 NaN 0.7028197
## 3.180626e+08 0.9273364 NaN 0.7028197
## 3.612343e+08 0.9273364 NaN 0.7028197
## 4.102658e+08 0.9273364 NaN 0.7028197
## 4.659526e+08 0.9273364 NaN 0.7028197
## 5.291979e+08 0.9273364 NaN 0.7028197
## 6.010277e+08 0.9273364 NaN 0.7028197
## 6.826072e+08 0.9273364 NaN 0.7028197
## 7.752597e+08 0.9273364 NaN 0.7028197
## 8.804884e+08 0.9273364 NaN 0.7028197
## 1.000000e+09 0.9273364 NaN 0.7028197
##
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 0.01464971.
plot(shares_ridge)
In the above plot we can see that for lower value of lamda rmse is low. Lets now provide low values of lambda and estimate the best model.
parameters_ridge <- expand.grid(alpha = 0,
lambda = seq(0.0001, 0.2, 0.0001))
set.seed(123456789)
shares_ridge2 <- train(log(shares) ~ .,
data = train_data %>%
dplyr::select(all_of(selected_final_variables)),
method = "glmnet",
tuneGrid = parameters_ridge,
trControl = ctrl_cv5)
#shares_ridge2
plot(shares_ridge2)
The above graph demonstrates that lambda value upto 0.01 is feasible as after this value the RMSE starts to increase.
shares_ridge2$bestTune$lambda
## [1] 0.0143
To compare results of different estimated models, OlS, KNN and ridge regression lets compute the predicted values for each model and summarize the error measures.
shares_model_list <- list(shares_lm2 =shares_lm2 ,
shares_train_knn_=shares_train_knn,
shares_ridge2=shares_ridge2)
shares_fitted <- shares_model_list %>%
sapply(function(x) predict(x, newdata=train_data[, selected_final_variables])) %>%
data.frame()
sapply(shares_fitted,
function(x) regressionMetrics(real = log(train_data$shares), predicted = x
)) %>% t()
## MSE RMSE MAE MedAE MSLE
## shares_lm2 0.7814003 0.8839685 0.6595661 0.5170039 0.01017652
## shares_train_knn_ 0.7647683 0.8745103 0.6544728 0.5107568 0.009982268
## shares_ridge2 0.7814237 0.8839817 0.6597824 0.5176545 0.01017712
## R2
## shares_lm2 0.0914667
## shares_train_knn_ 0.1108046
## shares_ridge2 0.09143946
For OLS RMSE and R2 is 0.88 and 0.091. For KNN the values are 0.87 and 0.11 respectively. Finally, the values for ridge regression are 0.78 and 0.091. Above all, we can say that KNN performs best among all the above mentioned results because the RMSE value is low in comparison to other models.
Lets now compute the predictions for the test data and see how our results differ.
shares_forecasts <- shares_model_list %>%
sapply(function(x) predict(x, newdata = test_data[,selected_final_variables])) %>%
data.frame()
sapply(shares_forecasts,
function(x) regressionMetrics(log(test_data$shares), x
)) %>% t()
## MSE RMSE MAE MedAE MSLE R2
## shares_lm2 0.803079 0.8961467 0.6632185 0.5123149 0.01079164 0.08648938
## shares_train_knn_ 0.8039636 0.8966402 0.6680053 0.5158331 0.01083345 0.08548314
## shares_ridge2 0.8030152 0.8961112 0.6633658 0.5132054 0.01079122 0.08656191
In the above output we can see that ridge regression performs well among all the models on test data. The reason is that it has the least RMSE value i.e 0.8961 and maximum R2 value 0.0865. So, in our case we will select ridge regression among all the other models to predict shares on unseen data set.