This is an example of how to load and use the Metafilter Data Dump files in R.
First we're going to test if there is a relationship beween the length of comment, and the number of favourites it receives, for a given user.
user = 1490
# Load comment data file
comm = read.delim("commentdata_mefi.txt", skip = 1, stringsAsFactors = F)
# Convert datestamp to POSIX format
comm$datestamp = as.POSIXct(comm$datestamp)
# Subset to a single userid
comms = subset(comm, userid == user)
# Load comment length data
len = read.delim("commentlength_mefi.txt", skip = 1, stringsAsFactors = F)
# Merge the two dataframes based on commentid
otab = merge(comms, len, by.x = "commentid", by.y = "commentid", all.x = T,
all.y = F)
# Ignore 0-favourite comments
otabf = subset(otab, faves != 0)
# Linear regression on log-transformed data, favourtes ~ length
mod = lm(log1p(otabf$faves) ~ log1p(otabf$length))
summary(mod)
##
## Call:
## lm(formula = log1p(otabf$faves) ~ log1p(otabf$length))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.779 -0.638 -0.230 0.349 3.213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0739 0.1314 8.17 7.5e-16 ***
## log1p(otabf$length) 0.0476 0.0220 2.17 0.031 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.748 on 1221 degrees of freedom
## Multiple R-squared: 0.00383, Adjusted R-squared: 0.00301
## F-statistic: 4.69 on 1 and 1221 DF, p-value: 0.0306
##
A scatterplot of log-transformed comment length vs. favourite count, with regression line.
plot(log1p(otabf$faves) ~ log1p(otabf$length), xlab = "Length", ylab = "Favourites")
abline(mod, col = 2, lwd = 2)
We can also produce a simple plot of number of comments per month over the life of the account.
h = pretty(otab$datestamp, 100)
otab$months = cut(otab$datestamp, h)
sumry = aggregate(otab$commentid, by = list(otab$months), FUN = length)
Plot of comments per month for given user:
plot(sumry$x ~ as.Date(sumry$Group.1), type = "l")
And what about looking at comment length over time?
sumry = aggregate(otab$length, by = list(otab$months), FUN = mean)
Plot of mean comment length per month for given user:
plot(sumry$x ~ as.Date(sumry$Group.1), type = "l")