The linear regression model is meant to study the relationship between a dependent (\(y\)) and a set of one or more independent variables (\(x_i\)). Mathematically, the population regression function (PRF) can be expressed as: \[y=f(x_1,x_2,...,x_k)+u=x_1\beta_1+x_2\beta_2+...+x_k\beta_k+u\] The idea behind the regression is to estimate the population parameters \(\boldsymbol{\beta}\) from a sample. Unlike the population counterpart, the sample regression function (SRF) can be expressed as matrix notation in: \[\boldsymbol{\hat{y}=X\hat{\beta}}\] In order to minimize the sum of the squared errors (SSE) and get the coefficients, we must apply the FOC to: \[S=\sum \hat{u}^2_i=\sum \boldsymbol{(y-X\beta)^2}\] The moments conditions results into: \[\boldsymbol{X'X\hat{\beta}=X'y}\to \boldsymbol{\hat{\beta}_{OLS}=[X'X]^{-1}X'y}\]
Under the assumptions 1-4, the OLS is unbiased
Under the assumptions 1-6 the OLS estimator is BLUE
This data-set summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity).
K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.
#Download, unzip and read the data set
download("http://archive.ics.uci.edu/ml/machine-learning-databases/00332/OnlineNewsPopularity.zip",dest="onp.zip",mode="wb")
files=unzip("onp.zip")
onp=read.csv(files[2])
The data set has 39.644 observations and 61 variables, where 58 of them are going to used as predictors for the number of shares.
onp=tbl_df(onp)
glimpse(onp)
## Observations: 39,644
## Variables: 61
## $ url (fctr) http://mashable.com/2013/01/07/...
## $ timedelta (dbl) 731, 731, 731, 731, 731, 731, 73...
## $ n_tokens_title (dbl) 12, 9, 9, 9, 13, 10, 8, 12, 11, ...
## $ n_tokens_content (dbl) 219, 255, 211, 531, 1072, 370, 9...
## $ n_unique_tokens (dbl) 0.6635945, 0.6047431, 0.5751295,...
## $ n_non_stop_words (dbl) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ n_non_stop_unique_tokens (dbl) 0.8153846, 0.7919463, 0.6638655,...
## $ num_hrefs (dbl) 4, 3, 3, 9, 19, 2, 21, 20, 2, 4,...
## $ num_self_hrefs (dbl) 2, 1, 1, 0, 19, 2, 20, 20, 0, 1,...
## $ num_imgs (dbl) 1, 1, 1, 1, 20, 0, 20, 20, 0, 1,...
## $ num_videos (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ average_token_length (dbl) 4.680365, 4.913725, 4.393365, 4....
## $ num_keywords (dbl) 5, 4, 6, 7, 7, 9, 10, 9, 7, 5, 8...
## $ data_channel_is_lifestyle (dbl) 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ data_channel_is_entertainment (dbl) 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ data_channel_is_bus (dbl) 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ data_channel_is_socmed (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ data_channel_is_tech (dbl) 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0,...
## $ data_channel_is_world (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ kw_min_min (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ kw_max_min (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ kw_avg_min (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ kw_min_max (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ kw_max_max (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ kw_avg_max (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ kw_min_avg (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ kw_max_avg (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ kw_avg_avg (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ self_reference_min_shares (dbl) 496, 0, 918, 0, 545, 8500, 545, ...
## $ self_reference_max_shares (dbl) 496, 0, 918, 0, 16000, 8500, 160...
## $ self_reference_avg_sharess (dbl) 496.000, 0.000, 918.000, 0.000, ...
## $ weekday_is_monday (dbl) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ weekday_is_tuesday (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ weekday_is_wednesday (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ weekday_is_thursday (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ weekday_is_friday (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ weekday_is_saturday (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ weekday_is_sunday (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ is_weekend (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ LDA_00 (dbl) 0.50033120, 0.79975569, 0.217792...
## $ LDA_01 (dbl) 0.37827893, 0.05004668, 0.033334...
## $ LDA_02 (dbl) 0.04000468, 0.05009625, 0.033351...
## $ LDA_03 (dbl) 0.04126265, 0.05010067, 0.033333...
## $ LDA_04 (dbl) 0.04012254, 0.05000071, 0.682188...
## $ global_subjectivity (dbl) 0.5216171, 0.3412458, 0.7022222,...
## $ global_sentiment_polarity (dbl) 0.09256198, 0.14894781, 0.323333...
## $ global_rate_positive_words (dbl) 0.04566210, 0.04313725, 0.056872...
## $ global_rate_negative_words (dbl) 0.013698630, 0.015686275, 0.0094...
## $ rate_positive_words (dbl) 0.7692308, 0.7333333, 0.8571429,...
## $ rate_negative_words (dbl) 0.2307692, 0.2666667, 0.1428571,...
## $ avg_positive_polarity (dbl) 0.3786364, 0.2869146, 0.4958333,...
## $ min_positive_polarity (dbl) 0.10000000, 0.03333333, 0.100000...
## $ max_positive_polarity (dbl) 0.7000000, 0.7000000, 1.0000000,...
## $ avg_negative_polarity (dbl) -0.3500000, -0.1187500, -0.46666...
## $ min_negative_polarity (dbl) -0.6000000, -0.1250000, -0.80000...
## $ max_negative_polarity (dbl) -0.2000000, -0.1000000, -0.13333...
## $ title_subjectivity (dbl) 0.5000000, 0.0000000, 0.0000000,...
## $ title_sentiment_polarity (dbl) -0.1875000, 0.0000000, 0.0000000...
## $ abs_title_subjectivity (dbl) 0.00000000, 0.50000000, 0.500000...
## $ abs_title_sentiment_polarity (dbl) 0.1875000, 0.0000000, 0.0000000,...
## $ shares (int) 593, 711, 1500, 1200, 505, 855, ...
It is useful to check for correlations on the predictors.
mcor=cor(select(onp,-c(url,timedelta,shares)))
diag(mcor)=NA
d3heatmap(mcor,colors=c("red","white","green")
,scale="col",dendrogram = "both"
,xaxis_font_size="7pt",yaxis_font_size = "7pt")
As we can see, there are segments of the heatmap that shows green and red segments, that reveals a positive and negative linear association respectively. Linear regression can still be used for prediction when a certain degree of collinearity exists (Markov assumption #3). However, this situation can result in a lose in the ability to interpret the coefficients. In order to solve this problem we can “re-process” to remove the pairwise correlated predictor.
The main options are:
The former method implies identifying a subset of the p predictors that are suppose to be associated to the independent variable, then we fit a model using least squares on the reduced set of predictors. There are two kinds of subset selection:
Due to the characteristics of the data set, both types of subset selection seemed to be unsuitable for our purposes. Therefore, we are going to estimate the full model for now, and extend with some diagnostics. Shrinkage and dimension reduction methods are going to be covered in future works.
Since the “number of shares” is a non-negative integer variable, it can be useful to apply a Poisson model for count data. Additionally, a natural logarithm of the shares is also estimated.
##
## ==================================================================================================
## Dependent variable:
## --------------------------------------------------------------------
## shares log(shares) shares
## normal normal Poisson
## (1) (2) (3)
## --------------------------------------------------------------------------------------------------
## n_tokens_title 90.16*** (28.67) 0.01*** (0.002) 0.03*** (0.0000)
## n_tokens_content 0.54** (0.22) 0.0000** (0.0000) 0.0001*** (0.0000)
## n_unique_tokens 2,273.60** (944.50) -0.07 (0.07) 0.62*** (0.001)
## num_hrefs 27.42*** (6.65) 0.004*** (0.001) 0.01*** (0.0000)
## num_self_hrefs -57.71*** (17.82) -0.01*** (0.001) -0.02*** (0.0000)
## num_imgs 14.79* (8.47) 0.003*** (0.001) 0.004*** (0.0000)
## num_videos 7.13 (15.68) 0.002* (0.001) 0.002*** (0.0000)
## average_token_length -530.61** (229.71) -0.08*** (0.02) -0.15*** (0.0003)
## num_keywords 50.34 (37.13) 0.01*** (0.003) 0.03*** (0.0001)
## data_channel_is_lifestyle -1,087.29*** (393.04) -0.11*** (0.03) -0.24*** (0.001)
## data_channel_is_entertainment -1,195.37*** (254.80) -0.19*** (0.02) -0.30*** (0.0003)
## data_channel_is_bus -827.98** (381.94) -0.17*** (0.03) -0.27*** (0.001)
## data_channel_is_socmed -619.04* (372.04) 0.16*** (0.03) -0.09*** (0.001)
## data_channel_is_tech -575.07 (370.72) 0.10*** (0.03) -0.14*** (0.001)
## data_channel_is_world -516.20 (374.95) -0.05* (0.03) -0.14*** (0.001)
## kw_min_min 2.20 (1.62) 0.001*** (0.0001) 0.001*** (0.0000)
## kw_max_min 0.09* (0.05) 0.0000*** (0.0000) 0.0000*** (0.0000)
## kw_avg_min -0.35 (0.31) -0.0001*** (0.0000) -0.0001*** (0.0000)
## kw_min_max -0.002* (0.001) -0.0000*** (0.0000) -0.0000*** (0.00)
## kw_max_max -0.001 (0.001) 0.0000 (0.0000) -0.0000*** (0.00)
## kw_avg_max -0.001 (0.001) -0.0000*** (0.0000) 0.0000*** (0.00)
## kw_min_avg -0.36*** (0.08) -0.0001*** (0.0000) -0.0001*** (0.0000)
## kw_max_avg -0.20*** (0.03) -0.0000*** (0.0000) -0.0000*** (0.0000)
## kw_avg_avg 1.66*** (0.14) 0.0003*** (0.0000) 0.0003*** (0.0000)
## self_reference_min_shares 0.03*** (0.01) 0.0000 (0.0000) 0.0000*** (0.00)
## self_reference_max_shares 0.01 (0.004) -0.0000 (0.0000) 0.0000*** (0.00)
## self_reference_avg_sharess -0.01 (0.01) 0.0000* (0.0000) -0.0000*** (0.00)
## weekday_is_monday 257.97 (263.10) -0.22*** (0.02) 0.06*** (0.0004)
## weekday_is_tuesday -280.92 (259.21) -0.29*** (0.02) -0.10*** (0.0004)
## weekday_is_wednesday -115.30 (259.14) -0.28*** (0.02) -0.05*** (0.0004)
## weekday_is_thursday -290.95 (259.70) -0.28*** (0.02) -0.10*** (0.0004)
## weekday_is_friday -255.34 (268.97) -0.21*** (0.02) -0.09*** (0.0004)
## weekday_is_saturday 380.96 (320.53) 0.003 (0.02) 0.08*** (0.0004)
## weekday_is_sunday
## is_weekend
## LDA_00 1,588,804.00** (662,365.20) -51.94 (50.12) 433.00*** (0.94)
## LDA_01 1,587,952.00** (662,335.00) -52.31 (50.11) 432.70*** (0.94)
## LDA_02 1,587,545.00** (662,362.10) -52.39 (50.12) 432.43*** (0.94)
## LDA_03 1,588,393.00** (662,340.60) -52.28 (50.11) 432.88*** (0.94)
## LDA_04 1,588,366.00** (662,365.80) -52.16 (50.12) 432.83*** (0.94)
## global_subjectivity 2,467.88*** (849.76) 0.41*** (0.06) 0.76*** (0.001)
## global_sentiment_polarity 709.30 (1,667.26) -0.11 (0.13) 0.31*** (0.002)
## global_rate_positive_words -13,386.35* (7,164.71) -1.14** (0.54) -4.14*** (0.01)
## global_rate_negative_words 2,431.77 (13,671.66) 0.61 (1.03) 1.15*** (0.02)
## rate_positive_words 251.67 (1,379.66) 0.29*** (0.10) 0.03*** (0.002)
## rate_negative_words 129.72 (1,560.75) 0.13 (0.12) -0.04*** (0.002)
## avg_positive_polarity -1,653.97 (1,365.70) -0.001 (0.10) -0.53*** (0.002)
## min_positive_polarity -1,811.61 (1,134.36) -0.25*** (0.09) -0.49*** (0.002)
## max_positive_polarity 309.29 (429.40) -0.03 (0.03) 0.13*** (0.001)
## avg_negative_polarity -1,733.63 (1,258.54) -0.15 (0.10) -0.47*** (0.002)
## min_negative_polarity 177.60 (456.53) 0.01 (0.03) 0.02*** (0.001)
## max_negative_polarity -273.59 (1,042.89) 0.07 (0.08) -0.02*** (0.001)
## title_subjectivity -98.40 (274.16) 0.06*** (0.02) -0.03*** (0.0004)
## title_sentiment_polarity 210.89 (250.43) 0.08*** (0.02) 0.05*** (0.0003)
## abs_title_subjectivity 636.89* (363.99) 0.14*** (0.03) 0.17*** (0.001)
## abs_title_sentiment_polarity 617.41 (395.69) 0.03 (0.03) 0.17*** (0.001)
## Constant -1,588,931.00** (662,368.80) 59.04 (50.12) -425.70*** (0.94)
## --------------------------------------------------------------------------------------------------
## Observations 39,644 39,644 39,644
## Log Likelihood -426,900.90 -50,709.97 -115,919,611.00
## Akaike Inf. Crit. 853,911.80 101,529.90 231,839,332.00
## ==================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
As it has shown above, the relevant variables change for the semi-elastic model in comparison with the lin-lin version. However, it shows better results in terms of normality of the residuals, which mostly explained by the existence of outliers.
par(mfrow=c(2,2))
plot(lshares.all)
This plot shows if residuals have non-linear patterns. Since the residuals show a somewhat uniform spread residuals around a horizontal line, we conclude that is there are non-linear relationships in the model.
A perfect straight line is associated with a perfect normal distribution of the residuals. In our case we observe that the observations tend to be disperse on the edges, this is associated with two “skinny” tails. However, this is a minor problem, moreover, since we are dealing with a large data-sets we can follow the central limit theorem and conclude that the errors follow a normal distribution.
With this graph we can check the assumption of equal variance. (homoscedasticity). Here we can observe that the existence of extreme observations significantly affect the slope of the line (a horizontal is associated with homoscedasticity). Here we can see those values:
outlierTest(lshares.all)
## rstudent unadjusted p-value Bonferonni p
## 17267 -9.323591 1.1247e-20 4.4586e-16
## 4710 -7.300535 2.8662e-13 1.1363e-08
## 9366 6.893020 5.4620e-12 2.1653e-07
## 38634 -6.566551 5.1494e-11 2.0414e-06
## 5371 6.481593 9.0759e-11 3.5980e-06
## 23238 6.364513 1.9591e-10 7.7665e-06
## 3146 6.311239 2.7681e-10 1.0974e-05
## 16010 6.271347 3.5794e-10 1.4190e-05
## 34477 -6.202767 5.5479e-10 2.1994e-05
## 4507 6.036392 1.5760e-09 6.2477e-05
Finally, in this plot helps us to find influential outliers on the linear regression analysis. Since the observations are not outside of the Cook’s distance line of scores, the observations does not affect the regression results.
We have conducted a multiple regression analysis for the number of shares based on several metrics. Since we are dealing with multiple regresors that are in some cases linearly related, it is imperative to apply a method of variable selection. However, best subset and step-wise options are unsuitable since they are computationally intensive. Lastly, the authors of the data set dealt with its complexity applying many algorithms, being the random forest the one with better results.