According to Alexa.com, Twitter.com is the 10th most popular site in the world. Twitter is a social network that allows users to share information as a string of 140 or less characters. This information is called a status update or tweet. Twitter also allows a user A to follow another user B. Then user A will be able to easily view all of user B's status updates. This interaction makes user A a follower of user B. The number of followers for a user can be seen as a status symbol or it can indicate a user's social media influence. This study attempts to predict the number of followers based upon the various characteristics of a twitter user. To be more exact, this study aims to predict the number of twitter followers for the top 1000 twitter accounts associated with the search term data.
Twitter has an API(Application Programming Interface) which provides access to information about the top 1000 users for any search term. Unfortunately, Twitter does not specify how these top users are determined, but the users can likely be identified as the most influential on twitter for a given search term. On October 10, 2013, the Twitter API was used to pull information about the top 1000 users associated with the term “data”. The final data is formatted as a CSV(Comma Separated Values) file with each row indicating a separate user and the columns as follows:
The data file was then split into 2 datasets. One for training the model and another for validating the model. The split was 60% for training and 40% for validation.
Obviously more can be added here. About the only thing to note here is the has_pic column contains only a single value that is different from the rest. Thus has_pic will not be included in the analysis.
library(stats)
# read in the files
set.seed(34567)
training_data = read.csv("twitter_user_data_data_training.csv")
validation_data = read.csv("twitter_user_data_data_test.csv")
full_data = read.csv("twitter_user_data_data.csv")
summary(training_data)
## handle name age
## AnnalynKurtz : 2 Big Data : 5 Min. : 5
## BigData_Journal: 2 Annalyn Kurtz : 2 1st Qu.: 650
## DataWzrd : 2 DATA : 2 Median :1153
## kdnuggets : 2 Economic Data : 2 Mean :1132
## neil_conway : 2 Gregory Piatetsky : 2 3rd Qu.:1582
## payscale : 2 Hitachi Data Systems: 2 Max. :2590
## (Other) :588 (Other) :585
## num_of_tweets has_profile has_pic num_following
## Min. : 2 Min. :0.000 Min. :0.000 Min. : 0
## 1st Qu.: 291 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 124
## Median : 865 Median :1.000 Median :1.000 Median : 426
## Mean : 4343 Mean :0.752 Mean :0.998 Mean : 906
## 3rd Qu.: 2652 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.: 1053
## Max. :203191 Max. :1.000 Max. :1.000 Max. :35422
##
## num_of_favorites num_of_lists num_of_followers
## Min. : 0 Min. : 0 Min. : 24
## 1st Qu.: 0 1st Qu.: 11 1st Qu.: 441
## Median : 4 Median : 34 Median : 977
## Mean : 456 Mean : 167 Mean : 4455
## 3rd Qu.: 42 3rd Qu.: 118 3rd Qu.: 2749
## Max. :70773 Max. :6412 Max. :211726
##
pairs(training_data[3:10])
When looking at the num_following versus the num_of_lists plot, there appear to be a few outliers. Thus, the user with the very high num_following and the users with the very high num_of_lists were removed. Therefore, the training set now contains 597 users instead of 600 users. Also, the analysis is not included here, but the models performed the same or better with the outliers removed.
plot(training_data$num_of_lists, training_data$num_following)
training_data = training_data[training_data$num_following < 20000 & training_data$num_of_lists <
5000, ]
dim(training_data) # new dimensions
## [1] 597 10
plot(training_data$num_of_lists, training_data$num_following)
First, a linear model with all the predictors was created. The full linear model identified the following predictors as significant: age, num_following, and num_of_lists. Then backwards step-wise regression was performed and the best fitting model was identified as the model containing the same predictors as the linear model previously mentioned.
The Box-Cox method was used to determine if any transformations needed to be performed on the response variable. As can be seen in the Box-Cox plot, the maximum value occurred at 0.06. Due to that value, two separate transformations were performed. The first transformation took the natural log of the dependent variable. The second transformation involved raising the dependent variable to the exponent, 0.06. Neither of these transformations yielded promising results, so the detailed analysis is not included in this report.
Next, all-subsets regression was performed using the leaps package in the R programming language. The leaps package will perform an exhaustive search of all possible subsets of the variables in order to find the best fitting models based upon the Mallows' Cp Criterion. As can be seen in the output plot, four models appeared to have low Cp values as compared to the rest. Not surprisingly, the four models contain different combinations of the 3 predictors already identified, including the model indentified by the step-wise regression. For these reasons and the reasons above, the four models with the lowest Cp values will be compared to determine the best model.
Here are the 4 candidate models being considered.
\[ num\_of\_followers = \beta_0 + \beta_1*age + \beta_2*num\_following + \beta_3*num\_of\_lists \]
\[ num\_of\_followers = \beta_0 + \beta_2*num\_following + \beta_3*num\_of\_lists \]
\[ num\_of\_followers = \beta_0 + \beta_3*num\_of\_lists \]
\[ num\_of\_followers = \beta_0 + \beta_1*age + \beta_3*num\_of\_lists \]
First look at the PRESS statistic. A PRESS statistic reasonably close to the SSE supports the validity of a linear regression model. As can be seen in the table, all four candidate models have a PRESS statistic reasonably close to the SSE.
Next look at the Mallows' \( C_p \) value. A lower \( C_p \) value is better, in particular, the \( C_p \) value should be less than p (number of predictors + 1 for the intercept). Also a \( C_p \) value equal to p indicates a model with no bias. Therefore, it is advantageous to find a \( C_p \) near p. Model 4 has the lowest overall \( C_p \) value, but for Model 4 the p is 3, making the \( C_p \) value greater than p. Only Model 1 has a \( C_p \) less than or equal to p. Model 1 has a \( C_p = 4 \) and \( p = 4 \). Thus, Mallows' \( C_p \) would favor Model 1.
Finally, look at the MSRP (Mean Squared Prediction Error) for the four models. A lower value indicates more predictive power. Model 1 has the lowest value of the four models, so MSRP favors Model 1 as well.
Overall, Model 1 appears to be the best predictive model for the twitter data. Model 1 was then recreated using all the available data, not just the training data. The final model for predicting the number of followers of a twitter user in the top 1000 for the search term 'data' is:
\[ num\_of\_followers = 898 - 1.5*age + .8*num\_following + 28.1*num\_of\_lists \]
Here is how the final model can be interpreted. Given a new account following 0 users and not it any lists, a twitter user in the top 1000 would be expected to have 898 followers. At first this seems ridulous. Why would a user have any followers without any activity and a brand new account. Remember, that this data is for twitter users in the top 1000, so for a new twitter account to appear in the top 1000, it is likely the person/organization that created the account is already influential outside of twitter. Think about a celebrity creating a twitter account. The account will quickly start attracting followers with the anticipation of future activity.
All other factors remaining the same, an increase in the age of the account by 1 day results in a decrease in the number of followers by 1.5. Thus having a twitter account for longer does not appear to increase followers. Also, when the other predictors remain the same, for every 1 user an account follows, it will result in .8 new followers. Another way to look at that is: keeping everything else the same, following 10 more people will result in 8 more followers.
With the rest of the predictors remaining the same, being included in 1 more list results in 28.1 more followers. Thus being in lists is the most influential predictor of followers. This makes sense considering the data is associated with the top 1000 and being in more lists means being more influential.
One interesting area of future research would be generalizing this model to work with other search terms. Does the same model still work well or do different terms need different models? How do search terms based upon trending topics have an affect?
It is possible to predict the number of followers for a twitter user in the top 1000 based upon the search term “data”. It appears the age of the account, the number of twitter users the account is following, and the number of lists including the account are all correlated with the number of followers for an account in the top 1000 twitter users for the search term “data”. For those that are familiar with twitter, it is not surprising that the number of tweets does not appear to be correlated with the number of followers. Thus, tweeting more is not helpful for getting more followers. Also, having a profile does not appear to be connected with the number of followers either.
Surprisingly, the quality of tweets does not appear to be correlated with the number of followers either. Number of tweets that have been favorited would be an indicator of the quality of the tweets. More favorites would appear to indicate more quality tweets. However, having more or less quality tweets does not appear to be correlated with the total number of followers.
basic_model = lm(num_of_followers ~ age + num_of_tweets + num_following + num_of_favorites +
num_of_lists + as.factor(has_profile), data = training_data)
summary(basic_model)
##
## Call:
## lm(formula = num_of_followers ~ age + num_of_tweets + num_following +
## num_of_favorites + num_of_lists + as.factor(has_profile),
## data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40344 -941 -42 781 135434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 572.9385 887.8686 0.65 0.519
## age -1.6756 0.7021 -2.39 0.017 *
## num_of_tweets -0.0142 0.0285 -0.50 0.618
## num_following 0.5167 0.2623 1.97 0.049 *
## num_of_favorites 0.0218 0.1024 0.21 0.832
## num_of_lists 27.6131 1.1464 24.09 <2e-16 ***
## as.factor(has_profile)1 955.6425 834.5350 1.15 0.253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8370 on 590 degrees of freedom
## Multiple R-squared: 0.552, Adjusted R-squared: 0.547
## F-statistic: 121 on 6 and 590 DF, p-value: <2e-16
# Stepwise Regression
library(MASS)
step <- stepAIC(basic_model, direction = "backward") # forward, backward, or both
## Start: AIC=10792
## num_of_followers ~ age + num_of_tweets + num_following + num_of_favorites +
## num_of_lists + as.factor(has_profile)
##
## Df Sum of Sq RSS AIC
## - num_of_favorites 1 3.17e+06 4.13e+10 10790
## - num_of_tweets 1 1.74e+07 4.14e+10 10790
## - as.factor(has_profile) 1 9.19e+07 4.14e+10 10791
## <none> 4.13e+10 10792
## - num_following 1 2.72e+08 4.16e+10 10794
## - age 1 3.99e+08 4.17e+10 10795
## - num_of_lists 1 4.06e+10 8.20e+10 11198
##
## Step: AIC=10790
## num_of_followers ~ age + num_of_tweets + num_following + num_of_lists +
## as.factor(has_profile)
##
## Df Sum of Sq RSS AIC
## - num_of_tweets 1 1.55e+07 4.14e+10 10788
## - as.factor(has_profile) 1 9.49e+07 4.14e+10 10789
## <none> 4.13e+10 10790
## - num_following 1 2.70e+08 4.16e+10 10792
## - age 1 4.05e+08 4.17e+10 10794
## - num_of_lists 1 4.07e+10 8.20e+10 11197
##
## Step: AIC=10788
## num_of_followers ~ age + num_following + num_of_lists + as.factor(has_profile)
##
## Df Sum of Sq RSS AIC
## - as.factor(has_profile) 1 9.71e+07 4.14e+10 10787
## <none> 4.14e+10 10788
## - num_following 1 2.55e+08 4.16e+10 10790
## - age 1 4.19e+08 4.18e+10 10792
## - num_of_lists 1 4.32e+10 8.45e+10 11213
##
## Step: AIC=10787
## num_of_followers ~ age + num_following + num_of_lists
##
## Df Sum of Sq RSS AIC
## <none> 4.14e+10 10787
## - num_following 1 2.90e+08 4.17e+10 10789
## - age 1 3.43e+08 4.18e+10 10790
## - num_of_lists 1 4.31e+10 8.46e+10 11211
step$anova # display results
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## num_of_followers ~ age + num_of_tweets + num_following + num_of_favorites +
## num_of_lists + as.factor(has_profile)
##
## Final Model:
## num_of_followers ~ age + num_following + num_of_lists
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 590 4.133e+10 10792
## 2 - num_of_favorites 1 3172372 591 4.134e+10 10790
## 3 - num_of_tweets 1 15470094 592 4.135e+10 10788
## 4 - as.factor(has_profile) 1 97119266 593 4.145e+10 10787
# use this to find the Cp values
library(leaps)
# only check for columns that we are looking at
x = training_data[, c(3, 5, 9)]
y = training_data[, 10]
models = leaps(x, y)
models
## $which
## 1 2 3
## 1 FALSE FALSE TRUE
## 1 TRUE FALSE FALSE
## 1 FALSE TRUE FALSE
## 2 TRUE FALSE TRUE
## 2 FALSE TRUE TRUE
## 2 TRUE TRUE FALSE
## 3 TRUE TRUE TRUE
##
## $label
## [1] "(Intercept)" "1" "2" "3"
##
## $size
## [1] 2 2 2 3 3 3 4
##
## $Cp
## [1] 5.744 643.377 707.995 3.883 7.133 644.087 4.000
plot(models$size, models$Cp, log = "y", xlab = "# of predictors", ylab = expression(C[p]),
main = "Cp values by Number of Predictors", col = "red", cex = 1.5)
minimum <- models$Cp == min(models$Cp)
best.model <- models$which[minimum, ]
x_val = validation_data[, c(3, 5, 9)]
y_val = validation_data[, 10]
models_val = leaps(x_val, y_val)
models_val
## $which
## 1 2 3
## 1 FALSE FALSE TRUE
## 1 TRUE FALSE FALSE
## 1 FALSE TRUE FALSE
## 2 TRUE FALSE TRUE
## 2 FALSE TRUE TRUE
## 2 TRUE TRUE FALSE
## 3 TRUE TRUE TRUE
##
## $label
## [1] "(Intercept)" "1" "2" "3"
##
## $size
## [1] 2 2 2 3 3 3 4
##
## $Cp
## [1] 1.825 532.267 544.891 2.663 3.453 530.847 4.000
library(qpcR) # for PRESS statistic
## Loading required package: minpack.lm
## Loading required package: rgl
## Loading required package: robustbase
# calculate the MSRP
msrp = function(actuals, predicted) {
sum((actuals - predicted)^2)/length(actuals)
}
# print out linear model info
model_info = function(model) {
print(summary(model))
# SSE
SSE = deviance(model)
print(paste("SSE:", SSE))
# PRESS
pr = PRESS(model, verbose = FALSE)
print(paste("PRESS:", pr$stat))
# MSE
MSE = tail(anova(model)$Mean, 1)
print(paste("MSE:", MSE))
# R2a
aR2 = summary(model)$adj.r.squared
print(paste("Adjusted R^2:", aR2))
}
model_1_training = lm(num_of_followers ~ age + num_following + num_of_lists,
training_data)
model_info(model_1_training)
##
## Call:
## lm(formula = num_of_followers ~ age + num_following + num_of_lists,
## data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39786 -923 -132 791 135797
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1061.216 790.541 1.34 0.180
## age -1.493 0.674 -2.22 0.027 *
## num_following 0.512 0.251 2.04 0.042 *
## num_of_lists 27.448 1.105 24.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8360 on 593 degrees of freedom
## Multiple R-squared: 0.551, Adjusted R-squared: 0.548
## F-statistic: 242 on 3 and 593 DF, p-value: <2e-16
##
## [1] "SSE: 41449021988.4906"
## [1] "PRESS: 47302849220.7327"
## [1] "MSE: 69897170.300996"
## [1] "Adjusted R^2: 0.548314941651526"
print("Cp: 4.0")
## [1] "Cp: 4.0"
# check how closely the model will predict the values in the validation set
predicted_vals = predict(model_1_training, newdata = validation_data)
MSRP = msrp(validation_data$num_of_followers, predicted_vals)
print(paste("MSPR:", MSRP))
## [1] "MSPR: 115198912.205575"
# this is for the validation data
model_1_validation = lm(num_of_followers ~ age + num_following + num_of_lists,
validation_data)
model_info(model_1_validation)
##
## Call:
## lm(formula = num_of_followers ~ age + num_following + num_of_lists,
## data = validation_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63446 -1246 -180 794 131880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 528.814 1203.679 0.44 0.661
## age -1.465 0.962 -1.52 0.129
## num_following 1.478 0.582 2.54 0.011 *
## num_of_lists 29.437 1.269 23.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10700 on 396 degrees of freedom
## Multiple R-squared: 0.588, Adjusted R-squared: 0.585
## F-statistic: 189 on 3 and 396 DF, p-value: <2e-16
##
## [1] "SSE: 45335987848.9446"
## [1] "PRESS: 67870852076.9708"
## [1] "MSE: 114484817.800365"
## [1] "Adjusted R^2: 0.585175647988523"
print("Cp: 4.0")
## [1] "Cp: 4.0"
model_2_training = lm(num_of_followers ~ num_following + num_of_lists, training_data)
model_info(model_2_training)
##
## Call:
## lm(formula = num_of_followers ~ num_following + num_of_lists,
## data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38134 -526 201 489 137223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -431.19 415.46 -1.04 0.300
## num_following 0.44 0.25 1.76 0.079 .
## num_of_lists 26.54 1.03 25.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8390 on 594 degrees of freedom
## Multiple R-squared: 0.547, Adjusted R-squared: 0.545
## F-statistic: 358 on 2 and 594 DF, p-value: <2e-16
##
## [1] "SSE: 41792324744.3521"
## [1] "PRESS: 47273475630.8647"
## [1] "MSE: 70357449.0645658"
## [1] "Adjusted R^2: 0.545340557434191"
print("Cp: 7.13")
## [1] "Cp: 7.13"
# check how closely the model will predict the values in the validation set
predicted_vals = predict(model_2_training, newdata = validation_data)
MSRP = msrp(validation_data$num_of_followers, predicted_vals)
print(paste("MSPR:", MSRP))
## [1] "MSPR: 116078074.89749"
# this is for the validation data
model_2_validation = lm(num_of_followers ~ num_following + num_of_lists, validation_data)
model_info(model_2_validation)
##
## Call:
## lm(formula = num_of_followers ~ num_following + num_of_lists,
## data = validation_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61565 -1239 167 742 132339
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -980.989 682.940 -1.44 0.152
## num_following 1.320 0.573 2.30 0.022 *
## num_of_lists 28.996 1.237 23.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10700 on 397 degrees of freedom
## Multiple R-squared: 0.586, Adjusted R-squared: 0.584
## F-statistic: 281 on 2 and 397 DF, p-value: <2e-16
##
## [1] "SSE: 45601206453.8601"
## [1] "PRESS: 66178476322.9373"
## [1] "MSE: 114864499.883779"
## [1] "Adjusted R^2: 0.583799907717905"
print("Cp: 3.45")
## [1] "Cp: 3.45"
model_3_training = lm(num_of_followers ~ num_of_lists, training_data)
model_info(model_3_training)
##
## Call:
## lm(formula = num_of_followers ~ num_of_lists, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39446 -610 90 450 136120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -113.13 374.73 -0.3 0.76
## num_of_lists 26.91 1.01 26.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8400 on 595 degrees of freedom
## Multiple R-squared: 0.545, Adjusted R-squared: 0.544
## F-statistic: 711 on 1 and 595 DF, p-value: <2e-16
##
## [1] "SSE: 42010118655.8661"
## [1] "PRESS: 46494898615.1528"
## [1] "MSE: 70605241.4384305"
## [1] "Adjusted R^2: 0.543739289280339"
print("Cp: 5.74")
## [1] "Cp: 5.74"
# check how closely the model will predict the values in the validation set
predicted_vals = predict(model_3_training, newdata = validation_data)
MSRP = msrp(validation_data$num_of_followers, predicted_vals)
print(paste("MSPR:", MSRP))
## [1] "MSPR: 116581326.297621"
# this is for the validation data
model_3_validation = lm(num_of_followers ~ num_of_lists, validation_data)
model_info(model_3_validation)
##
## Call:
## lm(formula = num_of_followers ~ num_of_lists, data = validation_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63385 -764 27 236 131203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -95.97 567.42 -0.17 0.87
## num_of_lists 29.15 1.24 23.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10800 on 398 degrees of freedom
## Multiple R-squared: 0.58, Adjusted R-squared: 0.579
## F-statistic: 550 on 1 and 398 DF, p-value: <2e-16
##
## [1] "SSE: 46209609925.9561"
## [1] "PRESS: 64508977644.9374"
## [1] "MSE: 116104547.552653"
## [1] "Adjusted R^2: 0.579306718310019"
print("Cp: 1.85")
## [1] "Cp: 1.85"
model_4_training = lm(num_of_followers ~ age + num_of_lists, training_data)
model_info(model_4_training)
##
## Call:
## lm(formula = num_of_followers ~ age + num_of_lists, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41091 -923 -182 761 134704
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1247.39 787.32 1.58 0.11
## age -1.31 0.67 -1.96 0.05 .
## num_of_lists 27.76 1.10 25.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8380 on 594 degrees of freedom
## Multiple R-squared: 0.547, Adjusted R-squared: 0.546
## F-statistic: 359 on 2 and 594 DF, p-value: <2e-16
##
## [1] "SSE: 41739217048.0925"
## [1] "PRESS: 46702603408.74"
## [1] "MSE: 70268042.1685058"
## [1] "Adjusted R^2: 0.545918317004282"
print("Cp: 3.883")
## [1] "Cp: 3.883"
# check how closely the model will predict the values in the validation set
predicted_vals = predict(model_4_training, newdata = validation_data)
MSRP = msrp(validation_data$num_of_followers, predicted_vals)
print(paste("MSPR:", MSRP))
## [1] "MSPR: 115952370.585762"
# this is for the validation data
model_4_validation = lm(num_of_followers ~ age + num_of_lists, validation_data)
model_info(model_4_validation)
##
## Call:
## lm(formula = num_of_followers ~ age + num_of_lists, data = validation_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64858 -922 -245 458 130785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1038.220 1194.976 0.87 0.39
## age -1.028 0.953 -1.08 0.28
## num_of_lists 29.469 1.278 23.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10800 on 397 degrees of freedom
## Multiple R-squared: 0.582, Adjusted R-squared: 0.579
## F-statistic: 276 on 2 and 397 DF, p-value: <2e-16
##
## [1] "SSE: 46074639492.9762"
## [1] "PRESS: 66404346921.3381"
## [1] "MSE: 116057026.43067"
## [1] "Adjusted R^2: 0.579478906369643"
print("Cp: 2.66")
## [1] "Cp: 2.66"
Before running the next model, the Box-Cox method was used to determine if any transformations need to be done on the response variable (num_of_followers). Box-Cox returns \( \lambda = 0.06 \) which is pretty close to 0, so a Log of the response was applied.
# run the box cox
model = lm(num_of_followers ~ age + num_following + num_of_lists, data = training_data)
bc = boxcox(model, xlab = expression(lambda), ylab = "log-Likelihood")
max = with(bc, x[which.max(y)])
# create a column for the transformed column
training_data$log_num_of_followers = log(training_data$num_of_followers)
validation_data$log_num_of_followers = log(validation_data$num_of_followers)
# run the model
# m4_error = model_test_function(log_num_of_followers ~ age + num_following
# + num_of_lists, 1)
Also due to the Box-Cox, the num_of_followers were raised to the 0.06 power.
# transform Y^.06 create a column for the transformed column
training_data$raise_num_of_followers = training_data$num_of_followers^0.06
validation_data$raise_num_of_followers = validation_data$num_of_followers^0.06
# m5_error = model_test_function(raise_num_of_followers ~ age +
# num_following + num_of_lists, 1)
Of the initial 5 models, the best predictive power on the validation set belongs to Model 3, the linear model using just the num_of_lists. However, a few other models can be applied.
robust_model_6 = rlm(num_of_followers ~ age + num_following + num_of_lists,
data = training_data, psi = psi.bisquare, init = "lts", maxit = 50)
summary(robust_model_6)
##
## Call: rlm(formula = num_of_followers ~ age + num_following + num_of_lists,
## data = training_data, psi = psi.bisquare, init = "lts", maxit = 50)
## Residuals:
## Min 1Q Median 3Q Max
## -2706.0 -203.2 11.3 456.6 168230.2
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 36.278 43.233 0.839
## age 0.067 0.037 1.816
## num_following 0.534 0.014 38.907
## num_of_lists 12.106 0.060 200.327
##
## Residual standard error: 434 on 593 degrees of freedom
predicted_vals = predict(robust_model_6, newdata = validation_data)
m6_error = sum(abs(predicted_vals - validation_data$num_of_followers))/length(predicted_vals)
print(paste("The average prediction error is:", m6_error))
## [1] "The average prediction error is: 2197.77911362574"
robust_model_7 = rlm(num_of_followers ~ num_following + num_of_lists, data = training_data,
psi = psi.bisquare, init = "lts", maxit = 50)
summary(robust_model_7)
##
## Call: rlm(formula = num_of_followers ~ num_following + num_of_lists,
## data = training_data, psi = psi.bisquare, init = "lts", maxit = 50)
## Residuals:
## Min 1Q Median 3Q Max
## -2689.82 -202.26 4.45 451.15 168132.63
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 98.632 22.759 4.334
## num_following 0.539 0.014 39.358
## num_of_lists 12.165 0.056 215.802
##
## Residual standard error: 435 on 594 degrees of freedom
predicted_vals = predict(robust_model_7, newdata = validation_data)
m7_error = sum(abs(predicted_vals - validation_data$num_of_followers))/length(predicted_vals)
print(paste("The average prediction error is:", m7_error))
## [1] "The average prediction error is: 2194.31317019601"
Robust Regression is less sensitive to outliers than ordinary least squares regression.
robust_model_8 = rlm(num_of_followers ~ num_following + num_of_lists, data = training_data,
psi = psi.bisquare, init = "lts", maxit = 50)
summary(robust_model_8)
##
## Call: rlm(formula = num_of_followers ~ num_following + num_of_lists,
## data = training_data, psi = psi.bisquare, init = "lts", maxit = 50)
## Residuals:
## Min 1Q Median 3Q Max
## -2225.13 -201.95 -9.38 530.47 169970.20
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 164.874 22.984 7.173
## num_following 0.476 0.014 34.426
## num_of_lists 11.296 0.057 198.440
##
## Residual standard error: 434 on 594 degrees of freedom
predicted_vals = predict(robust_model_8, newdata = validation_data)
m8_error = sum(abs(predicted_vals - validation_data$num_of_followers))/length(predicted_vals)
print(paste("The average prediction error is:", m8_error))
## [1] "The average prediction error is: 2266.81951363856"
library(tree)
regtree = tree(num_of_followers ~ age + num_of_tweets + num_following + num_of_favorites +
num_of_lists + as.factor(has_profile), data = training_data)
summary(regtree)
##
## Regression tree:
## tree(formula = num_of_followers ~ age + num_of_tweets + num_following +
## num_of_favorites + num_of_lists + as.factor(has_profile),
## data = training_data)
## Variables actually used in tree construction:
## [1] "num_of_lists"
## Number of terminal nodes: 4
## Residual mean deviance: 54600000 = 3.24e+10 / 593
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -47800 -804 -443 0 310 113000
predicted_vals = predict(regtree, newdata = validation_data)
m9_error = sum(abs(predicted_vals - validation_data$num_of_followers))/length(predicted_vals)
print(paste("The average prediction error is:", m9_error))
## [1] "The average prediction error is: 2456.18022240698"
Model 1 is chosen as the best model, so it can be rebuilt with all the data.
final_model = lm(num_of_followers ~ age + num_following + num_of_lists, data = full_data)
summary(final_model)
##
## Call:
## lm(formula = num_of_followers ~ age + num_following + num_of_lists,
## data = full_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54905 -1026 -172 779 134606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 898.297 694.125 1.29 0.1959
## age -1.492 0.566 -2.63 0.0086 **
## num_following 0.837 0.187 4.48 8.5e-06 ***
## num_of_lists 28.057 0.691 40.58 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9680 on 996 degrees of freedom
## Multiple R-squared: 0.642, Adjusted R-squared: 0.641
## F-statistic: 596 on 3 and 996 DF, p-value: <2e-16
confint(final_model)
## 2.5 % 97.5 %
## (Intercept) -463.8180 2260.4116
## age -2.6033 -0.3803
## num_following 0.4701 1.2039
## num_of_lists 26.7003 29.4140