Metafilter Data Dump

This is an example of how to load and use the Metafilter Data Dump files in R.

First we're going to test if there is a relationship beween the length of comment, and the number of favourites it receives, for a given user.

user = 1490
# Load comment data file
comm = read.delim("commentdata_mefi.txt", skip = 1, stringsAsFactors = F)
# Convert datestamp to POSIX format
comm$datestamp = as.POSIXct(comm$datestamp)
# Subset to a single userid
comms = subset(comm, userid == user)
# Load comment length data
len = read.delim("commentlength_mefi.txt", skip = 1, stringsAsFactors = F)

# Merge the two dataframes based on commentid
otab = merge(comms, len, by.x = "commentid", by.y = "commentid", all.x = T, 
    all.y = F)
# Ignore 0-favourite comments
otabf = subset(otab, faves != 0)

# Linear regression on log-transformed data, favourtes ~ length
mod = lm(log1p(otabf$faves) ~ log1p(otabf$length))
summary(mod)

## 
## Call:
## lm(formula = log1p(otabf$faves) ~ log1p(otabf$length))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.779 -0.638 -0.230  0.349  3.213 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.0739     0.1314    8.17  7.5e-16 ***
## log1p(otabf$length)   0.0476     0.0220    2.17    0.031 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.748 on 1221 degrees of freedom
## Multiple R-squared: 0.00383, Adjusted R-squared: 0.00301 
## F-statistic: 4.69 on 1 and 1221 DF,  p-value: 0.0306 
##

A scatterplot of log-transformed comment length vs. favourite count, with regression line.

plot(log1p(otabf$faves) ~ log1p(otabf$length), xlab = "Length", ylab = "Favourites")
abline(mod, col = 2, lwd = 2)

plot of chunk unnamed-chunk-2

We can also produce a simple plot of number of comments per month over the life of the account.

h = pretty(otab$datestamp, 100)
otab$months = cut(otab$datestamp, h)

sumry = aggregate(otab$commentid, by = list(otab$months), FUN = length)

Plot of comments per month for given user:

plot(sumry$x ~ as.Date(sumry$Group.1), type = "l")

plot of chunk unnamed-chunk-4

And what about looking at comment length over time?

sumry = aggregate(otab$length, by = list(otab$months), FUN = mean)

Plot of mean comment length per month for given user:

plot(sumry$x ~ as.Date(sumry$Group.1), type = "l")

plot of chunk unnamed-chunk-6