Twitter User Analysis

According to Alexa.com, Twitter.com is the 10th most popular site in the world. Twitter is a social network that allows users to share information as a string of 140 or less characters. This information is called a status update or tweet. Twitter also allows a user A to follow another user B. Then user A will be able to easily view all of user B's status updates. This interaction makes user A a follower of user B. The number of followers for a user can be seen as a status symbol or it can indicate a user's social media influence. This study attempts to predict the number of followers based upon the various characteristics of a twitter user. To be more exact, this study aims to predict the number of twitter followers for the top 1000 twitter accounts associated with the search term data.

About the data

Twitter has an API(Application Programming Interface) which provides access to information about the top 1000 users for any search term. Unfortunately, Twitter does not specify how these top users are determined, but the users can likely be identified as the most influential on twitter for a given search term. On October 10, 2013, the Twitter API was used to pull information about the top 1000 users associated with the term “data”. The final data is formatted as a CSV(Comma Separated Values) file with each row indicating a separate user and the columns as follows:

  1. handle - twitter username | string
  2. name - full name of the twitter user | string
  3. age - number of days the user has existed on twitter | number
  4. num_of_tweets - number of tweets this user has created (includes retweets) | number
  5. has_profile - 1 if the user has created a profile description, 0 otherwise | boolean
  6. has_pic - 1 if the user has setup a profile pic, 0 otherwise | boolean
  7. num_following - number of other twitter users, this user is following | number
  8. num_of_favorites - number of tweets the user has favorited | number
  9. num_of_lists - number of public lists this user has been added to | number
  10. num_of_followers - number of other users following this user | number

Training and Validation Data

The data file was then split into 2 datasets. One for training the model and another for validating the model. The split was 60% for training and 40% for validation.

Exploratory Analysis

Obviously more can be added here. About the only thing to note here is the has_pic column contains only a single value that is different from the rest. Thus has_pic will not be included in the analysis.

library(stats)
# read in the files
set.seed(34567)
training_data = read.csv("twitter_user_data_data_training.csv")
validation_data = read.csv("twitter_user_data_data_test.csv")
full_data = read.csv("twitter_user_data_data.csv")
summary(training_data)
##              handle                      name          age      
##  AnnalynKurtz   :  2   Big Data            :  5   Min.   :   5  
##  BigData_Journal:  2   Annalyn Kurtz       :  2   1st Qu.: 650  
##  DataWzrd       :  2   DATA                :  2   Median :1153  
##  kdnuggets      :  2   Economic Data       :  2   Mean   :1132  
##  neil_conway    :  2   Gregory Piatetsky   :  2   3rd Qu.:1582  
##  payscale       :  2   Hitachi Data Systems:  2   Max.   :2590  
##  (Other)        :588   (Other)             :585                 
##  num_of_tweets     has_profile       has_pic      num_following  
##  Min.   :     2   Min.   :0.000   Min.   :0.000   Min.   :    0  
##  1st Qu.:   291   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:  124  
##  Median :   865   Median :1.000   Median :1.000   Median :  426  
##  Mean   :  4343   Mean   :0.752   Mean   :0.998   Mean   :  906  
##  3rd Qu.:  2652   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.: 1053  
##  Max.   :203191   Max.   :1.000   Max.   :1.000   Max.   :35422  
##                                                                  
##  num_of_favorites  num_of_lists  num_of_followers
##  Min.   :    0    Min.   :   0   Min.   :    24  
##  1st Qu.:    0    1st Qu.:  11   1st Qu.:   441  
##  Median :    4    Median :  34   Median :   977  
##  Mean   :  456    Mean   : 167   Mean   :  4455  
##  3rd Qu.:   42    3rd Qu.: 118   3rd Qu.:  2749  
##  Max.   :70773    Max.   :6412   Max.   :211726  
## 

pairs(training_data[3:10])

plot of chunk unnamed-chunk-1

Outliers

When looking at the num_following versus the num_of_lists plot, there appear to be a few outliers. Thus, the user with the very high num_following and the users with the very high num_of_lists were removed. Therefore, the training set now contains 597 users instead of 600 users. Also, the analysis is not included here, but the models performed the same or better with the outliers removed.

plot(training_data$num_of_lists, training_data$num_following)

plot of chunk unnamed-chunk-2

training_data = training_data[training_data$num_following < 20000 & training_data$num_of_lists < 
    5000, ]
dim(training_data)  # new dimensions
## [1] 597  10
plot(training_data$num_of_lists, training_data$num_following)

plot of chunk unnamed-chunk-2

Analysis

First, a linear model with all the predictors was created. The full linear model identified the following predictors as significant: age, num_following, and num_of_lists. Then backwards step-wise regression was performed and the best fitting model was identified as the model containing the same predictors as the linear model previously mentioned.

The Box-Cox method was used to determine if any transformations needed to be performed on the response variable. As can be seen in the Box-Cox plot, the maximum value occurred at 0.06. Due to that value, two separate transformations were performed. The first transformation took the natural log of the dependent variable. The second transformation involved raising the dependent variable to the exponent, 0.06. Neither of these transformations yielded promising results, so the detailed analysis is not included in this report.

Next, all-subsets regression was performed using the leaps package in the R programming language. The leaps package will perform an exhaustive search of all possible subsets of the variables in order to find the best fitting models based upon the Mallows' Cp Criterion. As can be seen in the output plot, four models appeared to have low Cp values as compared to the rest. Not surprisingly, the four models contain different combinations of the 3 predictors already identified, including the model indentified by the step-wise regression. For these reasons and the reasons above, the four models with the lowest Cp values will be compared to determine the best model.

Here are the 4 candidate models being considered.

Model 1

\[ num\_of\_followers = \beta_0 + \beta_1*age + \beta_2*num\_following + \beta_3*num\_of\_lists \]

Model 2

\[ num\_of\_followers = \beta_0 + \beta_2*num\_following + \beta_3*num\_of\_lists \]

Model 3

\[ num\_of\_followers = \beta_0 + \beta_3*num\_of\_lists \]

Model 4

\[ num\_of\_followers = \beta_0 + \beta_1*age + \beta_3*num\_of\_lists \]

Best Model

First look at the PRESS statistic. A PRESS statistic reasonably close to the SSE supports the validity of a linear regression model. As can be seen in the table, all four candidate models have a PRESS statistic reasonably close to the SSE.

Next look at the Mallows' \( C_p \) value. A lower \( C_p \) value is better, in particular, the \( C_p \) value should be less than p (number of predictors + 1 for the intercept). Also a \( C_p \) value equal to p indicates a model with no bias. Therefore, it is advantageous to find a \( C_p \) near p. Model 4 has the lowest overall \( C_p \) value, but for Model 4 the p is 3, making the \( C_p \) value greater than p. Only Model 1 has a \( C_p \) less than or equal to p. Model 1 has a \( C_p = 4 \) and \( p = 4 \). Thus, Mallows' \( C_p \) would favor Model 1.

Finally, look at the MSRP (Mean Squared Prediction Error) for the four models. A lower value indicates more predictive power. Model 1 has the lowest value of the four models, so MSRP favors Model 1 as well.

Overall, Model 1 appears to be the best predictive model for the twitter data. Model 1 was then recreated using all the available data, not just the training data. The final model for predicting the number of followers of a twitter user in the top 1000 for the search term 'data' is:

\[ num\_of\_followers = 898 - 1.5*age + .8*num\_following + 28.1*num\_of\_lists \]

Here is how the final model can be interpreted. Given a new account following 0 users and not it any lists, a twitter user in the top 1000 would be expected to have 898 followers. At first this seems ridulous. Why would a user have any followers without any activity and a brand new account. Remember, that this data is for twitter users in the top 1000, so for a new twitter account to appear in the top 1000, it is likely the person/organization that created the account is already influential outside of twitter. Think about a celebrity creating a twitter account. The account will quickly start attracting followers with the anticipation of future activity.

All other factors remaining the same, an increase in the age of the account by 1 day results in a decrease in the number of followers by 1.5. Thus having a twitter account for longer does not appear to increase followers. Also, when the other predictors remain the same, for every 1 user an account follows, it will result in .8 new followers. Another way to look at that is: keeping everything else the same, following 10 more people will result in 8 more followers.

With the rest of the predictors remaining the same, being included in 1 more list results in 28.1 more followers. Thus being in lists is the most influential predictor of followers. This makes sense considering the data is associated with the top 1000 and being in more lists means being more influential.

One interesting area of future research would be generalizing this model to work with other search terms. Does the same model still work well or do different terms need different models? How do search terms based upon trending topics have an affect?

Conclusions

It is possible to predict the number of followers for a twitter user in the top 1000 based upon the search term “data”. It appears the age of the account, the number of twitter users the account is following, and the number of lists including the account are all correlated with the number of followers for an account in the top 1000 twitter users for the search term “data”. For those that are familiar with twitter, it is not surprising that the number of tweets does not appear to be correlated with the number of followers. Thus, tweeting more is not helpful for getting more followers. Also, having a profile does not appear to be connected with the number of followers either.

Surprisingly, the quality of tweets does not appear to be correlated with the number of followers either. Number of tweets that have been favorited would be an indicator of the quality of the tweets. More favorites would appear to indicate more quality tweets. However, having more or less quality tweets does not appear to be correlated with the total number of followers.

R code

basic_model = lm(num_of_followers ~ age + num_of_tweets + num_following + num_of_favorites + 
    num_of_lists + as.factor(has_profile), data = training_data)
summary(basic_model)
## 
## Call:
## lm(formula = num_of_followers ~ age + num_of_tweets + num_following + 
##     num_of_favorites + num_of_lists + as.factor(has_profile), 
##     data = training_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -40344   -941    -42    781 135434 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             572.9385   887.8686    0.65    0.519    
## age                      -1.6756     0.7021   -2.39    0.017 *  
## num_of_tweets            -0.0142     0.0285   -0.50    0.618    
## num_following             0.5167     0.2623    1.97    0.049 *  
## num_of_favorites          0.0218     0.1024    0.21    0.832    
## num_of_lists             27.6131     1.1464   24.09   <2e-16 ***
## as.factor(has_profile)1 955.6425   834.5350    1.15    0.253    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8370 on 590 degrees of freedom
## Multiple R-squared:  0.552,  Adjusted R-squared:  0.547 
## F-statistic:  121 on 6 and 590 DF,  p-value: <2e-16

# Stepwise Regression
library(MASS)
step <- stepAIC(basic_model, direction = "backward")  # forward, backward, or both
## Start:  AIC=10792
## num_of_followers ~ age + num_of_tweets + num_following + num_of_favorites + 
##     num_of_lists + as.factor(has_profile)
## 
##                          Df Sum of Sq      RSS   AIC
## - num_of_favorites        1  3.17e+06 4.13e+10 10790
## - num_of_tweets           1  1.74e+07 4.14e+10 10790
## - as.factor(has_profile)  1  9.19e+07 4.14e+10 10791
## <none>                                4.13e+10 10792
## - num_following           1  2.72e+08 4.16e+10 10794
## - age                     1  3.99e+08 4.17e+10 10795
## - num_of_lists            1  4.06e+10 8.20e+10 11198
## 
## Step:  AIC=10790
## num_of_followers ~ age + num_of_tweets + num_following + num_of_lists + 
##     as.factor(has_profile)
## 
##                          Df Sum of Sq      RSS   AIC
## - num_of_tweets           1  1.55e+07 4.14e+10 10788
## - as.factor(has_profile)  1  9.49e+07 4.14e+10 10789
## <none>                                4.13e+10 10790
## - num_following           1  2.70e+08 4.16e+10 10792
## - age                     1  4.05e+08 4.17e+10 10794
## - num_of_lists            1  4.07e+10 8.20e+10 11197
## 
## Step:  AIC=10788
## num_of_followers ~ age + num_following + num_of_lists + as.factor(has_profile)
## 
##                          Df Sum of Sq      RSS   AIC
## - as.factor(has_profile)  1  9.71e+07 4.14e+10 10787
## <none>                                4.14e+10 10788
## - num_following           1  2.55e+08 4.16e+10 10790
## - age                     1  4.19e+08 4.18e+10 10792
## - num_of_lists            1  4.32e+10 8.45e+10 11213
## 
## Step:  AIC=10787
## num_of_followers ~ age + num_following + num_of_lists
## 
##                 Df Sum of Sq      RSS   AIC
## <none>                       4.14e+10 10787
## - num_following  1  2.90e+08 4.17e+10 10789
## - age            1  3.43e+08 4.18e+10 10790
## - num_of_lists   1  4.31e+10 8.46e+10 11211
step$anova  # display results 
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## num_of_followers ~ age + num_of_tweets + num_following + num_of_favorites + 
##     num_of_lists + as.factor(has_profile)
## 
## Final Model:
## num_of_followers ~ age + num_following + num_of_lists
## 
## 
##                       Step Df Deviance Resid. Df Resid. Dev   AIC
## 1                                            590  4.133e+10 10792
## 2       - num_of_favorites  1  3172372       591  4.134e+10 10790
## 3          - num_of_tweets  1 15470094       592  4.135e+10 10788
## 4 - as.factor(has_profile)  1 97119266       593  4.145e+10 10787

All-subsets Regression

# use this to find the Cp values
library(leaps)
# only check for columns that we are looking at
x = training_data[, c(3, 5, 9)]
y = training_data[, 10]
models = leaps(x, y)
models
## $which
##       1     2     3
## 1 FALSE FALSE  TRUE
## 1  TRUE FALSE FALSE
## 1 FALSE  TRUE FALSE
## 2  TRUE FALSE  TRUE
## 2 FALSE  TRUE  TRUE
## 2  TRUE  TRUE FALSE
## 3  TRUE  TRUE  TRUE
## 
## $label
## [1] "(Intercept)" "1"           "2"           "3"          
## 
## $size
## [1] 2 2 2 3 3 3 4
## 
## $Cp
## [1]   5.744 643.377 707.995   3.883   7.133 644.087   4.000

plot(models$size, models$Cp, log = "y", xlab = "# of predictors", ylab = expression(C[p]), 
    main = "Cp values by Number of Predictors", col = "red", cex = 1.5)

plot of chunk unnamed-chunk-4


minimum <- models$Cp == min(models$Cp)
best.model <- models$which[minimum, ]

x_val = validation_data[, c(3, 5, 9)]
y_val = validation_data[, 10]
models_val = leaps(x_val, y_val)
models_val
## $which
##       1     2     3
## 1 FALSE FALSE  TRUE
## 1  TRUE FALSE FALSE
## 1 FALSE  TRUE FALSE
## 2  TRUE FALSE  TRUE
## 2 FALSE  TRUE  TRUE
## 2  TRUE  TRUE FALSE
## 3  TRUE  TRUE  TRUE
## 
## $label
## [1] "(Intercept)" "1"           "2"           "3"          
## 
## $size
## [1] 2 2 2 3 3 3 4
## 
## $Cp
## [1]   1.825 532.267 544.891   2.663   3.453 530.847   4.000
library(qpcR)  # for PRESS statistic
## Loading required package: minpack.lm
## Loading required package: rgl
## Loading required package: robustbase

# calculate the MSRP
msrp = function(actuals, predicted) {
    sum((actuals - predicted)^2)/length(actuals)
}

# print out linear model info
model_info = function(model) {
    print(summary(model))
    # SSE
    SSE = deviance(model)
    print(paste("SSE:", SSE))
    # PRESS
    pr = PRESS(model, verbose = FALSE)
    print(paste("PRESS:", pr$stat))
    # MSE
    MSE = tail(anova(model)$Mean, 1)
    print(paste("MSE:", MSE))
    # R2a
    aR2 = summary(model)$adj.r.squared
    print(paste("Adjusted R^2:", aR2))

}

Model 1: Linear Model with age, num_following, and num_of_lists

model_1_training = lm(num_of_followers ~ age + num_following + num_of_lists, 
    training_data)
model_info(model_1_training)
## 
## Call:
## lm(formula = num_of_followers ~ age + num_following + num_of_lists, 
##     data = training_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39786   -923   -132    791 135797 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1061.216    790.541    1.34    0.180    
## age             -1.493      0.674   -2.22    0.027 *  
## num_following    0.512      0.251    2.04    0.042 *  
## num_of_lists    27.448      1.105   24.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8360 on 593 degrees of freedom
## Multiple R-squared:  0.551,  Adjusted R-squared:  0.548 
## F-statistic:  242 on 3 and 593 DF,  p-value: <2e-16
## 
## [1] "SSE: 41449021988.4906"
## [1] "PRESS: 47302849220.7327"
## [1] "MSE: 69897170.300996"
## [1] "Adjusted R^2: 0.548314941651526"
print("Cp: 4.0")
## [1] "Cp: 4.0"
# check how closely the model will predict the values in the validation set
predicted_vals = predict(model_1_training, newdata = validation_data)
MSRP = msrp(validation_data$num_of_followers, predicted_vals)
print(paste("MSPR:", MSRP))
## [1] "MSPR: 115198912.205575"

# this is for the validation data
model_1_validation = lm(num_of_followers ~ age + num_following + num_of_lists, 
    validation_data)
model_info(model_1_validation)
## 
## Call:
## lm(formula = num_of_followers ~ age + num_following + num_of_lists, 
##     data = validation_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -63446  -1246   -180    794 131880 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    528.814   1203.679    0.44    0.661    
## age             -1.465      0.962   -1.52    0.129    
## num_following    1.478      0.582    2.54    0.011 *  
## num_of_lists    29.437      1.269   23.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10700 on 396 degrees of freedom
## Multiple R-squared:  0.588,  Adjusted R-squared:  0.585 
## F-statistic:  189 on 3 and 396 DF,  p-value: <2e-16
## 
## [1] "SSE: 45335987848.9446"
## [1] "PRESS: 67870852076.9708"
## [1] "MSE: 114484817.800365"
## [1] "Adjusted R^2: 0.585175647988523"
print("Cp: 4.0")
## [1] "Cp: 4.0"

Model 2: Linear Model with num_following and num_of_lists

model_2_training = lm(num_of_followers ~ num_following + num_of_lists, training_data)
model_info(model_2_training)
## 
## Call:
## lm(formula = num_of_followers ~ num_following + num_of_lists, 
##     data = training_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38134   -526    201    489 137223 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -431.19     415.46   -1.04    0.300    
## num_following     0.44       0.25    1.76    0.079 .  
## num_of_lists     26.54       1.03   25.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8390 on 594 degrees of freedom
## Multiple R-squared:  0.547,  Adjusted R-squared:  0.545 
## F-statistic:  358 on 2 and 594 DF,  p-value: <2e-16
## 
## [1] "SSE: 41792324744.3521"
## [1] "PRESS: 47273475630.8647"
## [1] "MSE: 70357449.0645658"
## [1] "Adjusted R^2: 0.545340557434191"
print("Cp: 7.13")
## [1] "Cp: 7.13"
# check how closely the model will predict the values in the validation set
predicted_vals = predict(model_2_training, newdata = validation_data)
MSRP = msrp(validation_data$num_of_followers, predicted_vals)
print(paste("MSPR:", MSRP))
## [1] "MSPR: 116078074.89749"

# this is for the validation data
model_2_validation = lm(num_of_followers ~ num_following + num_of_lists, validation_data)
model_info(model_2_validation)
## 
## Call:
## lm(formula = num_of_followers ~ num_following + num_of_lists, 
##     data = validation_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -61565  -1239    167    742 132339 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -980.989    682.940   -1.44    0.152    
## num_following    1.320      0.573    2.30    0.022 *  
## num_of_lists    28.996      1.237   23.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10700 on 397 degrees of freedom
## Multiple R-squared:  0.586,  Adjusted R-squared:  0.584 
## F-statistic:  281 on 2 and 397 DF,  p-value: <2e-16
## 
## [1] "SSE: 45601206453.8601"
## [1] "PRESS: 66178476322.9373"
## [1] "MSE: 114864499.883779"
## [1] "Adjusted R^2: 0.583799907717905"
print("Cp: 3.45")
## [1] "Cp: 3.45"

Model 3: Linear Model with just num_of_lists

model_3_training = lm(num_of_followers ~ num_of_lists, training_data)
model_info(model_3_training)
## 
## Call:
## lm(formula = num_of_followers ~ num_of_lists, data = training_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39446   -610     90    450 136120 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -113.13     374.73    -0.3     0.76    
## num_of_lists    26.91       1.01    26.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8400 on 595 degrees of freedom
## Multiple R-squared:  0.545,  Adjusted R-squared:  0.544 
## F-statistic:  711 on 1 and 595 DF,  p-value: <2e-16
## 
## [1] "SSE: 42010118655.8661"
## [1] "PRESS: 46494898615.1528"
## [1] "MSE: 70605241.4384305"
## [1] "Adjusted R^2: 0.543739289280339"
print("Cp: 5.74")
## [1] "Cp: 5.74"
# check how closely the model will predict the values in the validation set
predicted_vals = predict(model_3_training, newdata = validation_data)
MSRP = msrp(validation_data$num_of_followers, predicted_vals)
print(paste("MSPR:", MSRP))
## [1] "MSPR: 116581326.297621"

# this is for the validation data
model_3_validation = lm(num_of_followers ~ num_of_lists, validation_data)
model_info(model_3_validation)
## 
## Call:
## lm(formula = num_of_followers ~ num_of_lists, data = validation_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -63385   -764     27    236 131203 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -95.97     567.42   -0.17     0.87    
## num_of_lists    29.15       1.24   23.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10800 on 398 degrees of freedom
## Multiple R-squared:  0.58,   Adjusted R-squared:  0.579 
## F-statistic:  550 on 1 and 398 DF,  p-value: <2e-16
## 
## [1] "SSE: 46209609925.9561"
## [1] "PRESS: 64508977644.9374"
## [1] "MSE: 116104547.552653"
## [1] "Adjusted R^2: 0.579306718310019"
print("Cp: 1.85")
## [1] "Cp: 1.85"

Model 4: Linear Model with age and num_of_lists

model_4_training = lm(num_of_followers ~ age + num_of_lists, training_data)
model_info(model_4_training)
## 
## Call:
## lm(formula = num_of_followers ~ age + num_of_lists, data = training_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41091   -923   -182    761 134704 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1247.39     787.32    1.58     0.11    
## age             -1.31       0.67   -1.96     0.05 .  
## num_of_lists    27.76       1.10   25.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8380 on 594 degrees of freedom
## Multiple R-squared:  0.547,  Adjusted R-squared:  0.546 
## F-statistic:  359 on 2 and 594 DF,  p-value: <2e-16
## 
## [1] "SSE: 41739217048.0925"
## [1] "PRESS: 46702603408.74"
## [1] "MSE: 70268042.1685058"
## [1] "Adjusted R^2: 0.545918317004282"
print("Cp: 3.883")
## [1] "Cp: 3.883"
# check how closely the model will predict the values in the validation set
predicted_vals = predict(model_4_training, newdata = validation_data)
MSRP = msrp(validation_data$num_of_followers, predicted_vals)
print(paste("MSPR:", MSRP))
## [1] "MSPR: 115952370.585762"

# this is for the validation data
model_4_validation = lm(num_of_followers ~ age + num_of_lists, validation_data)
model_info(model_4_validation)
## 
## Call:
## lm(formula = num_of_followers ~ age + num_of_lists, data = validation_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64858   -922   -245    458 130785 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1038.220   1194.976    0.87     0.39    
## age            -1.028      0.953   -1.08     0.28    
## num_of_lists   29.469      1.278   23.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10800 on 397 degrees of freedom
## Multiple R-squared:  0.582,  Adjusted R-squared:  0.579 
## F-statistic:  276 on 2 and 397 DF,  p-value: <2e-16
## 
## [1] "SSE: 46074639492.9762"
## [1] "PRESS: 66404346921.3381"
## [1] "MSE: 116057026.43067"
## [1] "Adjusted R^2: 0.579478906369643"
print("Cp: 2.66")
## [1] "Cp: 2.66"

Model Transformation

Model 5: Linear Model with Log(num_of_followers) and age, num_following, and num_of_lists

Before running the next model, the Box-Cox method was used to determine if any transformations need to be done on the response variable (num_of_followers). Box-Cox returns \( \lambda = 0.06 \) which is pretty close to 0, so a Log of the response was applied.

# run the box cox
model = lm(num_of_followers ~ age + num_following + num_of_lists, data = training_data)
bc = boxcox(model, xlab = expression(lambda), ylab = "log-Likelihood")

plot of chunk unnamed-chunk-10

max = with(bc, x[which.max(y)])

# create a column for the transformed column
training_data$log_num_of_followers = log(training_data$num_of_followers)
validation_data$log_num_of_followers = log(validation_data$num_of_followers)

# run the model

# m4_error = model_test_function(log_num_of_followers ~ age + num_following
# + num_of_lists, 1)

Model 6: Linear Model with num_of_followers.06 and age, num_following, and num_of_lists

Also due to the Box-Cox, the num_of_followers were raised to the 0.06 power.

# transform Y^.06 create a column for the transformed column
training_data$raise_num_of_followers = training_data$num_of_followers^0.06
validation_data$raise_num_of_followers = validation_data$num_of_followers^0.06

# m5_error = model_test_function(raise_num_of_followers ~ age +
# num_following + num_of_lists, 1)

Initial Conclusions

Of the initial 5 models, the best predictive power on the validation set belongs to Model 3, the linear model using just the num_of_lists. However, a few other models can be applied.

Model 6: Robust linear Regression with age, num_following, and num_of_lists

robust_model_6 = rlm(num_of_followers ~ age + num_following + num_of_lists, 
    data = training_data, psi = psi.bisquare, init = "lts", maxit = 50)
summary(robust_model_6)
## 
## Call: rlm(formula = num_of_followers ~ age + num_following + num_of_lists, 
##     data = training_data, psi = psi.bisquare, init = "lts", maxit = 50)
## Residuals:
##      Min       1Q   Median       3Q      Max 
##  -2706.0   -203.2     11.3    456.6 168230.2 
## 
## Coefficients:
##               Value   Std. Error t value
## (Intercept)    36.278  43.233      0.839
## age             0.067   0.037      1.816
## num_following   0.534   0.014     38.907
## num_of_lists   12.106   0.060    200.327
## 
## Residual standard error: 434 on 593 degrees of freedom
predicted_vals = predict(robust_model_6, newdata = validation_data)
m6_error = sum(abs(predicted_vals - validation_data$num_of_followers))/length(predicted_vals)
print(paste("The average prediction error is:", m6_error))
## [1] "The average prediction error is: 2197.77911362574"

Model 7: Robust linear Regression with num_following, and num_of_lists

robust_model_7 = rlm(num_of_followers ~ num_following + num_of_lists, data = training_data, 
    psi = psi.bisquare, init = "lts", maxit = 50)
summary(robust_model_7)
## 
## Call: rlm(formula = num_of_followers ~ num_following + num_of_lists, 
##     data = training_data, psi = psi.bisquare, init = "lts", maxit = 50)
## Residuals:
##       Min        1Q    Median        3Q       Max 
##  -2689.82   -202.26      4.45    451.15 168132.63 
## 
## Coefficients:
##               Value   Std. Error t value
## (Intercept)    98.632  22.759      4.334
## num_following   0.539   0.014     39.358
## num_of_lists   12.165   0.056    215.802
## 
## Residual standard error: 435 on 594 degrees of freedom
predicted_vals = predict(robust_model_7, newdata = validation_data)
m7_error = sum(abs(predicted_vals - validation_data$num_of_followers))/length(predicted_vals)
print(paste("The average prediction error is:", m7_error))
## [1] "The average prediction error is: 2194.31317019601"

Model 8: Robust linear Regression with num_following, and num_of_lists

Robust Regression is less sensitive to outliers than ordinary least squares regression.

robust_model_8 = rlm(num_of_followers ~ num_following + num_of_lists, data = training_data, 
    psi = psi.bisquare, init = "lts", maxit = 50)
summary(robust_model_8)
## 
## Call: rlm(formula = num_of_followers ~ num_following + num_of_lists, 
##     data = training_data, psi = psi.bisquare, init = "lts", maxit = 50)
## Residuals:
##       Min        1Q    Median        3Q       Max 
##  -2225.13   -201.95     -9.38    530.47 169970.20 
## 
## Coefficients:
##               Value   Std. Error t value
## (Intercept)   164.874  22.984      7.173
## num_following   0.476   0.014     34.426
## num_of_lists   11.296   0.057    198.440
## 
## Residual standard error: 434 on 594 degrees of freedom
predicted_vals = predict(robust_model_8, newdata = validation_data)
m8_error = sum(abs(predicted_vals - validation_data$num_of_followers))/length(predicted_vals)
print(paste("The average prediction error is:", m8_error))
## [1] "The average prediction error is: 2266.81951363856"

Model 9: Decision Tree

library(tree)
regtree = tree(num_of_followers ~ age + num_of_tweets + num_following + num_of_favorites + 
    num_of_lists + as.factor(has_profile), data = training_data)
summary(regtree)
## 
## Regression tree:
## tree(formula = num_of_followers ~ age + num_of_tweets + num_following + 
##     num_of_favorites + num_of_lists + as.factor(has_profile), 
##     data = training_data)
## Variables actually used in tree construction:
## [1] "num_of_lists"
## Number of terminal nodes:  4 
## Residual mean deviance:  54600000 = 3.24e+10 / 593 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -47800    -804    -443       0     310  113000
predicted_vals = predict(regtree, newdata = validation_data)
m9_error = sum(abs(predicted_vals - validation_data$num_of_followers))/length(predicted_vals)
print(paste("The average prediction error is:", m9_error))
## [1] "The average prediction error is: 2456.18022240698"

Model 10: Quantile Regression

Further Conclusion

Model 1 is chosen as the best model, so it can be rebuilt with all the data.


final_model = lm(num_of_followers ~ age + num_following + num_of_lists, data = full_data)
summary(final_model)
## 
## Call:
## lm(formula = num_of_followers ~ age + num_following + num_of_lists, 
##     data = full_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54905  -1026   -172    779 134606 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    898.297    694.125    1.29   0.1959    
## age             -1.492      0.566   -2.63   0.0086 ** 
## num_following    0.837      0.187    4.48  8.5e-06 ***
## num_of_lists    28.057      0.691   40.58  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9680 on 996 degrees of freedom
## Multiple R-squared:  0.642,  Adjusted R-squared:  0.641 
## F-statistic:  596 on 3 and 996 DF,  p-value: <2e-16
confint(final_model)
##                   2.5 %    97.5 %
## (Intercept)   -463.8180 2260.4116
## age             -2.6033   -0.3803
## num_following    0.4701    1.2039
## num_of_lists    26.7003   29.4140