Background

The linear regression model is meant to study the relationship between a dependent (\(y\)) and a set of one or more independent variables (\(x_i\)). Mathematically, the population regression function (PRF) can be expressed as: \[y=f(x_1,x_2,...,x_k)+u=x_1\beta_1+x_2\beta_2+...+x_k\beta_k+u\] The idea behind the regression is to estimate the population parameters \(\boldsymbol{\beta}\) from a sample. Unlike the population counterpart, the sample regression function (SRF) can be expressed as matrix notation in: \[\boldsymbol{\hat{y}=X\hat{\beta}}\] In order to minimize the sum of the squared errors (SSE) and get the coefficients, we must apply the FOC to: \[S=\sum \hat{u}^2_i=\sum \boldsymbol{(y-X\beta)^2}\] The moments conditions results into: \[\boldsymbol{X'X\hat{\beta}=X'y}\to \boldsymbol{\hat{\beta}_{OLS}=[X'X]^{-1}X'y}\]

Statistical assumptions

Linear in the parameters
Random sampling
No perfect collinearity in \(\boldsymbol{X}\)
Zero conditional mean \(E[u|\boldsymbol{x}]=0\)

Under the assumptions 1-4, the OLS is unbiased

Homoskedasticity \(E(\sigma^{2})=\sigma^{2}\)
Normality \(u\)~\(N(0,\sigma^{2})\)

Under the assumptions 1-6 the OLS estimator is BLUE

Data

Online News Popularity

Description

This data-set summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity).

Source

K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.

Analysis

#Download, unzip and read the data set
download("http://archive.ics.uci.edu/ml/machine-learning-databases/00332/OnlineNewsPopularity.zip",dest="onp.zip",mode="wb")
files=unzip("onp.zip")
onp=read.csv(files[2])

Descriptive and structure

The data set has 39.644 observations and 61 variables, where 58 of them are going to used as predictors for the number of shares.

onp=tbl_df(onp)
glimpse(onp)

## Observations: 39,644
## Variables: 61
## $ url                           (fctr) http://mashable.com/2013/01/07/...
## $ timedelta                     (dbl) 731, 731, 731, 731, 731, 731, 73...
## $ n_tokens_title                (dbl) 12, 9, 9, 9, 13, 10, 8, 12, 11, ...
## $ n_tokens_content              (dbl) 219, 255, 211, 531, 1072, 370, 9...
## $ n_unique_tokens               (dbl) 0.6635945, 0.6047431, 0.5751295,...
## $ n_non_stop_words              (dbl) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ n_non_stop_unique_tokens      (dbl) 0.8153846, 0.7919463, 0.6638655,...
## $ num_hrefs                     (dbl) 4, 3, 3, 9, 19, 2, 21, 20, 2, 4,...
## $ num_self_hrefs                (dbl) 2, 1, 1, 0, 19, 2, 20, 20, 0, 1,...
## $ num_imgs                      (dbl) 1, 1, 1, 1, 20, 0, 20, 20, 0, 1,...
## $ num_videos                    (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ average_token_length          (dbl) 4.680365, 4.913725, 4.393365, 4....
## $ num_keywords                  (dbl) 5, 4, 6, 7, 7, 9, 10, 9, 7, 5, 8...
## $ data_channel_is_lifestyle     (dbl) 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ data_channel_is_entertainment (dbl) 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ data_channel_is_bus           (dbl) 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ data_channel_is_socmed        (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ data_channel_is_tech          (dbl) 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0,...
## $ data_channel_is_world         (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ kw_min_min                    (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ kw_max_min                    (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ kw_avg_min                    (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ kw_min_max                    (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ kw_max_max                    (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ kw_avg_max                    (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ kw_min_avg                    (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ kw_max_avg                    (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ kw_avg_avg                    (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ self_reference_min_shares     (dbl) 496, 0, 918, 0, 545, 8500, 545, ...
## $ self_reference_max_shares     (dbl) 496, 0, 918, 0, 16000, 8500, 160...
## $ self_reference_avg_sharess    (dbl) 496.000, 0.000, 918.000, 0.000, ...
## $ weekday_is_monday             (dbl) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ weekday_is_tuesday            (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ weekday_is_wednesday          (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ weekday_is_thursday           (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ weekday_is_friday             (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ weekday_is_saturday           (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ weekday_is_sunday             (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ is_weekend                    (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ LDA_00                        (dbl) 0.50033120, 0.79975569, 0.217792...
## $ LDA_01                        (dbl) 0.37827893, 0.05004668, 0.033334...
## $ LDA_02                        (dbl) 0.04000468, 0.05009625, 0.033351...
## $ LDA_03                        (dbl) 0.04126265, 0.05010067, 0.033333...
## $ LDA_04                        (dbl) 0.04012254, 0.05000071, 0.682188...
## $ global_subjectivity           (dbl) 0.5216171, 0.3412458, 0.7022222,...
## $ global_sentiment_polarity     (dbl) 0.09256198, 0.14894781, 0.323333...
## $ global_rate_positive_words    (dbl) 0.04566210, 0.04313725, 0.056872...
## $ global_rate_negative_words    (dbl) 0.013698630, 0.015686275, 0.0094...
## $ rate_positive_words           (dbl) 0.7692308, 0.7333333, 0.8571429,...
## $ rate_negative_words           (dbl) 0.2307692, 0.2666667, 0.1428571,...
## $ avg_positive_polarity         (dbl) 0.3786364, 0.2869146, 0.4958333,...
## $ min_positive_polarity         (dbl) 0.10000000, 0.03333333, 0.100000...
## $ max_positive_polarity         (dbl) 0.7000000, 0.7000000, 1.0000000,...
## $ avg_negative_polarity         (dbl) -0.3500000, -0.1187500, -0.46666...
## $ min_negative_polarity         (dbl) -0.6000000, -0.1250000, -0.80000...
## $ max_negative_polarity         (dbl) -0.2000000, -0.1000000, -0.13333...
## $ title_subjectivity            (dbl) 0.5000000, 0.0000000, 0.0000000,...
## $ title_sentiment_polarity      (dbl) -0.1875000, 0.0000000, 0.0000000...
## $ abs_title_subjectivity        (dbl) 0.00000000, 0.50000000, 0.500000...
## $ abs_title_sentiment_polarity  (dbl) 0.1875000, 0.0000000, 0.0000000,...
## $ shares                        (int) 593, 711, 1500, 1200, 505, 855, ...

Pairwise correlations

It is useful to check for correlations on the predictors.

mcor=cor(select(onp,-c(url,timedelta,shares)))
diag(mcor)=NA
d3heatmap(mcor,colors=c("red","white","green")
          ,scale="col",dendrogram = "both"
          ,xaxis_font_size="7pt",yaxis_font_size = "7pt")

As we can see, there are segments of the heatmap that shows green and red segments, that reveals a positive and negative linear association respectively. Linear regression can still be used for prediction when a certain degree of collinearity exists (Markov assumption #3). However, this situation can result in a lose in the ability to interpret the coefficients. In order to solve this problem we can “re-process” to remove the pairwise correlated predictor.

The main options are:

Subset selection
Shrinkage
Dimension reduction

The former method implies identifying a subset of the p predictors that are suppose to be associated to the independent variable, then we fit a model using least squares on the reduced set of predictors. There are two kinds of subset selection:

Best Subset: Fits separate regression for each possible combination of the p predictors. It is computationally intensive since it needs to run \(2^p\) possible models (1.048.576 with 58 predictors), thus, not a good option when we are dealing with several p.
Forward/backward step-wise: Initiates with no (all) predictors, and then adds (remove) to the model one by one. It is indeed more efficient than the best subset. Although the amount of models that we would compare is \(1+p(p+1)/2\), that is, 1.712 models taken in count the 58 predictors is less than the 1 million option of best subset, it is also a big number.

Due to the characteristics of the data set, both types of subset selection seemed to be unsuitable for our purposes. Therefore, we are going to estimate the full model for now, and extend with some diagnostics. Shrinkage and dimension reduction methods are going to be covered in future works.

Fitting the model

Since the “number of shares” is a non-negative integer variable, it can be useful to apply a Poisson model for count data. Additionally, a natural logarithm of the shares is also estimated.

## 
## ==================================================================================================
##                                                       Dependent variable:                         
##                               --------------------------------------------------------------------
##                                          shares                log(shares)           shares       
##                                          normal                  normal              Poisson      
##                                           (1)                      (2)                 (3)        
## --------------------------------------------------------------------------------------------------
## n_tokens_title                      90.16*** (28.67)         0.01*** (0.002)    0.03*** (0.0000)  
## n_tokens_content                     0.54** (0.22)          0.0000** (0.0000)  0.0001*** (0.0000) 
## n_unique_tokens                   2,273.60** (944.50)         -0.07 (0.07)       0.62*** (0.001)  
## num_hrefs                           27.42*** (6.65)         0.004*** (0.001)    0.01*** (0.0000)  
## num_self_hrefs                     -57.71*** (17.82)        -0.01*** (0.001)    -0.02*** (0.0000) 
## num_imgs                             14.79* (8.47)          0.003*** (0.001)    0.004*** (0.0000) 
## num_videos                            7.13 (15.68)           0.002* (0.001)     0.002*** (0.0000) 
## average_token_length               -530.61** (229.71)        -0.08*** (0.02)    -0.15*** (0.0003) 
## num_keywords                         50.34 (37.13)           0.01*** (0.003)    0.03*** (0.0001)  
## data_channel_is_lifestyle        -1,087.29*** (393.04)       -0.11*** (0.03)    -0.24*** (0.001)  
## data_channel_is_entertainment    -1,195.37*** (254.80)       -0.19*** (0.02)    -0.30*** (0.0003) 
## data_channel_is_bus                -827.98** (381.94)        -0.17*** (0.03)    -0.27*** (0.001)  
## data_channel_is_socmed             -619.04* (372.04)         0.16*** (0.03)     -0.09*** (0.001)  
## data_channel_is_tech                -575.07 (370.72)         0.10*** (0.03)     -0.14*** (0.001)  
## data_channel_is_world               -516.20 (374.95)          -0.05* (0.03)     -0.14*** (0.001)  
## kw_min_min                            2.20 (1.62)           0.001*** (0.0001)   0.001*** (0.0000) 
## kw_max_min                            0.09* (0.05)         0.0000*** (0.0000)  0.0000*** (0.0000) 
## kw_avg_min                            -0.35 (0.31)         -0.0001*** (0.0000) -0.0001*** (0.0000)
## kw_min_max                          -0.002* (0.001)        -0.0000*** (0.0000)  -0.0000*** (0.00) 
## kw_max_max                           -0.001 (0.001)          0.0000 (0.0000)    -0.0000*** (0.00) 
## kw_avg_max                           -0.001 (0.001)        -0.0000*** (0.0000)  0.0000*** (0.00)  
## kw_min_avg                          -0.36*** (0.08)        -0.0001*** (0.0000) -0.0001*** (0.0000)
## kw_max_avg                          -0.20*** (0.03)        -0.0000*** (0.0000) -0.0000*** (0.0000)
## kw_avg_avg                           1.66*** (0.14)        0.0003*** (0.0000)  0.0003*** (0.0000) 
## self_reference_min_shares            0.03*** (0.01)          0.0000 (0.0000)    0.0000*** (0.00)  
## self_reference_max_shares             0.01 (0.004)          -0.0000 (0.0000)    0.0000*** (0.00)  
## self_reference_avg_sharess            -0.01 (0.01)          0.0000* (0.0000)    -0.0000*** (0.00) 
## weekday_is_monday                   257.97 (263.10)          -0.22*** (0.02)    0.06*** (0.0004)  
## weekday_is_tuesday                  -280.92 (259.21)         -0.29*** (0.02)    -0.10*** (0.0004) 
## weekday_is_wednesday                -115.30 (259.14)         -0.28*** (0.02)    -0.05*** (0.0004) 
## weekday_is_thursday                 -290.95 (259.70)         -0.28*** (0.02)    -0.10*** (0.0004) 
## weekday_is_friday                   -255.34 (268.97)         -0.21*** (0.02)    -0.09*** (0.0004) 
## weekday_is_saturday                 380.96 (320.53)           0.003 (0.02)      0.08*** (0.0004)  
## weekday_is_sunday                                                                                 
## is_weekend                                                                                        
## LDA_00                        1,588,804.00** (662,365.20)    -51.94 (50.12)     433.00*** (0.94)  
## LDA_01                        1,587,952.00** (662,335.00)    -52.31 (50.11)     432.70*** (0.94)  
## LDA_02                        1,587,545.00** (662,362.10)    -52.39 (50.12)     432.43*** (0.94)  
## LDA_03                        1,588,393.00** (662,340.60)    -52.28 (50.11)     432.88*** (0.94)  
## LDA_04                        1,588,366.00** (662,365.80)    -52.16 (50.12)     432.83*** (0.94)  
## global_subjectivity               2,467.88*** (849.76)       0.41*** (0.06)      0.76*** (0.001)  
## global_sentiment_polarity          709.30 (1,667.26)          -0.11 (0.13)       0.31*** (0.002)  
## global_rate_positive_words       -13,386.35* (7,164.71)      -1.14** (0.54)      -4.14*** (0.01)  
## global_rate_negative_words        2,431.77 (13,671.66)         0.61 (1.03)       1.15*** (0.02)   
## rate_positive_words                251.67 (1,379.66)         0.29*** (0.10)      0.03*** (0.002)  
## rate_negative_words                129.72 (1,560.75)           0.13 (0.12)      -0.04*** (0.002)  
## avg_positive_polarity             -1,653.97 (1,365.70)        -0.001 (0.10)     -0.53*** (0.002)  
## min_positive_polarity             -1,811.61 (1,134.36)       -0.25*** (0.09)    -0.49*** (0.002)  
## max_positive_polarity               309.29 (429.40)           -0.03 (0.03)       0.13*** (0.001)  
## avg_negative_polarity             -1,733.63 (1,258.54)        -0.15 (0.10)      -0.47*** (0.002)  
## min_negative_polarity               177.60 (456.53)            0.01 (0.03)       0.02*** (0.001)  
## max_negative_polarity              -273.59 (1,042.89)          0.07 (0.08)      -0.02*** (0.001)  
## title_subjectivity                  -98.40 (274.16)          0.06*** (0.02)     -0.03*** (0.0004) 
## title_sentiment_polarity            210.89 (250.43)          0.08*** (0.02)     0.05*** (0.0003)  
## abs_title_subjectivity              636.89* (363.99)         0.14*** (0.03)      0.17*** (0.001)  
## abs_title_sentiment_polarity        617.41 (395.69)            0.03 (0.03)       0.17*** (0.001)  
## Constant                      -1,588,931.00** (662,368.80)    59.04 (50.12)     -425.70*** (0.94) 
## --------------------------------------------------------------------------------------------------
## Observations                             39,644                  39,644              39,644       
## Log Likelihood                        -426,900.90              -50,709.97        -115,919,611.00  
## Akaike Inf. Crit.                      853,911.80              101,529.90        231,839,332.00   
## ==================================================================================================
## Note:                                                                  *p<0.1; **p<0.05; ***p<0.01

As it has shown above, the relevant variables change for the semi-elastic model in comparison with the lin-lin version. However, it shows better results in terms of normality of the residuals, which mostly explained by the existence of outliers.

Diagnostics

par(mfrow=c(2,2))
plot(lshares.all)

Residual vs fitted

This plot shows if residuals have non-linear patterns. Since the residuals show a somewhat uniform spread residuals around a horizontal line, we conclude that is there are non-linear relationships in the model.

Normal Q-Q

A perfect straight line is associated with a perfect normal distribution of the residuals. In our case we observe that the observations tend to be disperse on the edges, this is associated with two “skinny” tails. However, this is a minor problem, moreover, since we are dealing with a large data-sets we can follow the central limit theorem and conclude that the errors follow a normal distribution.

Scale-location

With this graph we can check the assumption of equal variance. (homoscedasticity). Here we can observe that the existence of extreme observations significantly affect the slope of the line (a horizontal is associated with homoscedasticity). Here we can see those values:

outlierTest(lshares.all)

##        rstudent unadjusted p-value Bonferonni p
## 17267 -9.323591         1.1247e-20   4.4586e-16
## 4710  -7.300535         2.8662e-13   1.1363e-08
## 9366   6.893020         5.4620e-12   2.1653e-07
## 38634 -6.566551         5.1494e-11   2.0414e-06
## 5371   6.481593         9.0759e-11   3.5980e-06
## 23238  6.364513         1.9591e-10   7.7665e-06
## 3146   6.311239         2.7681e-10   1.0974e-05
## 16010  6.271347         3.5794e-10   1.4190e-05
## 34477 -6.202767         5.5479e-10   2.1994e-05
## 4507   6.036392         1.5760e-09   6.2477e-05

Residual vs leverage

Finally, in this plot helps us to find influential outliers on the linear regression analysis. Since the observations are not outside of the Cook’s distance line of scores, the observations does not affect the regression results.

Final remarks

We have conducted a multiple regression analysis for the number of shares based on several metrics. Since we are dealing with multiple regresors that are in some cases linearly related, it is imperative to apply a method of variable selection. However, best subset and step-wise options are unsuitable since they are computationally intensive. Lastly, the authors of the data set dealt with its complexity applying many algorithms, being the random forest the one with better results.

References

Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). Springer Series in Statistics The Elements of (Vol. 27). Springer.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning (Vol. 102). Springer.

Linear regression

Obryan Poyser