How much of Weibo is advertisement? - a glimpse through user comments

Motivation of study

Once upon a time, Sina’s Weibo is deemed the savior to social activism and freedom-of-speech in China. It’s a plaza on which users wearing masks of internet anonimity get to do things authentic to their emotions (rather than adhering to complicated Chinese social norms) - vent, express and create. The quality of user experience on Weibo, however, sometimes comes into question due to the presence of (1) censorship, (2) plagiarized, faked and repeated contents, (3) too much sponsored contents. (3) is behind the motivation for this study.

While there are still signs that Western companies are interested in Weibo marketing, vicious competition and a lack of solid ROI has marred Weibo marketing as an every-party-loses practice on the decline (or at least in need of a major cleanup). Trending topics and hot search lists have long been invaded by sponsored contents. Aside from glaringly dominant marketing associated with the mainland entertainment industry, popular accounts that share news and compose original comedic contents, known as duanzishou, are too implicit marketting accounts (yingxiao zhanghao), targetting their large number of followers (~ millions) and beyond. The number of reposts (or shares) of a (marketing) post, and the number of followers of such an account indicate reach and engagement. These could constitute the KPI of a Weibo marketing middleman company who takes jobs from businesses and organizes duanzishou. This results in the phenomenon of “zombie fans” (jiangshifen) or “navy force” (shuijun) - faked follower accounts who have no activities other than sharing and commenting on marketing posts, presumeably getting paid from doing so.

User backlash has gone hand-in-hand with the engineered, nakedly commercial phenomenon. Interestingly, ad posts and the corresponding patterns in user protest come in two flavors. “Hard” ads posted directly under a merchant’s or marketting event’s official account could be dominated by paid “navy” comments, artificially popularizing these marketing posts. These posts and comments are largely ignored by real users. On the other hand, sophisticated and subtle native ad placement by individual duanzishou, especially those who started out authentic, meet with protests from users. The protests come in the form of a set of indicative vocabulary in the comments. Users crowd-flag plagiarized, faked or sponsored contents by voting, to the top, comments that repeatedly contain indicative words, such as “copycat” (chaoxi), “advertisement” (guanggao). It’s like a mob has converged on a favorite abusive to hurtle at the offender.

Based on these observations, this study suggests the following:

The occurrence of an indicative vocabulary in comment texts is potentially a strong feature towards predicting whether a Weibo post is a piece of (native) advertisement. The more the indicative words, the more likely the post is a “Type 2” native ad.
If the number of reposts is an indicator of promotional effort; the number of comments is an indicator of real user engagement, then a comparison between the two could potentially indicate ad contents. The higher the repost-to-comment ratio, the more likely the post is a “Type 1” or Type 2" ad.
Comments that are replies may indicate a higher degree of user engagement, and may correspond to a lower use of indicative words.
The use of emoticons and topic-mentions (hashtags, or double-hashtags in Weibo) potentially indicate “navy” comments, which in turn indicate ad contents. The more the presence of emoticons and/or topic-mentions, the more likely the post is a “Type 1” hard ad.
Short comments in conjunction with emoticons and topic-mentions could indicate the presence of “navy” that didn’t want to make too much of an effort. The shorter the comments, the more likely the post is a “Type 1” hard ad. However, it’s to be noted that comment length depends on a variety of factors, e.g. post contents, degree of controversy etc. This is likely a weak feature in classifying ad posts.

(A positive sentiment is also notably associated with “navy” comments, a topic that I have not yet investigated due to limited knowledge on my part, in tools for Chinese-to-English machine translation and sentiment analysis. They will potentially be topics of future studies.)

This study shall take some of these features mentioned, and investigate if any patterns exist. To do so I analyzed 844 posts created by popular users (those with large number of followers) on the morning of 29th October 2015. This blog post describes:

Getting and cleaning the data
Feature-extraction from the posts and comment texts
Exploring and preliminary structure-finding in the data

(The next phase of the analysis aims at building a classifier. To do this I’ll spend substantial time to manually label each of the posts. I’d hope to present the results soon in a separate blog post.)

Getting, cleaning and feature extraction from the data

One wrangles Weibo data from an unwilling API. The most updated API documentation can be found here. I also find [these instructions] to set up the API very useful (on top of which I composed a step-by-step how-to guide on accessing Weibo APIs).

There are substantial limitations to getting data through Weibo APIs for a hobbyist user like me. For example, 1. You can’t get certain data through APIs. For example, a stronger feature would be the presence of indicative vocabulary in the top-voted / hot comments (e.g. first 20) rather than all of the comments. Unfortunately, there isn’t an API for accessing the votes on individual comments. One could potentially crawl the web and analyze the unstructured text for this purpose. 2. Each user is limited to 150 API-calls in an hour. In reality it’s less for unknown reasons? Working on this part-time, it took days to get all the data I need. 3. It’s tough to get real-time data. Weibo administrators and/or users do limit and censor comments. Over time, comments or entire posts could disappear. Certain posts are set to disallow comments in the first place. Therefore, as comment texts are obtained over a period of time (due to the 150-API-call-an-hour limit), the distribution of number of comments changed.

Not withstanding the limitations, here’s what I did:

TASK 1: Get the raw data - a list of hot posts in a particular point in time

A simple function is created to call the “hot user posts” API and parse down the JSON texts subsequently obtained. In this case, I got 1084 posts from the morning of 29th October 2015 (I failed to note the time). Weibo hot users’ posts are categorized into 21 topics. Conceivably, categories such as “entertainment”, return far more posts than categories such as “government”. The data thusly obtained is structured in this same way: data[[i]][[j]] is the jth post from the ith category.

##### TASK 1: Get raw data #####

### Set the stage
library("jsonlite")
library(rjson)
library("XML")
library(httr)
options(scipen = 999) # To prevent scientific notation on large integers (Post IDs)
Sys.setlocale(category = "LC_ALL", locale = "chs") # To read Chinese characters (UTF-8 encoding) correctly
Source <- "3182990075"
Access_token <- "2.00orEdqCbHV6TDfe7c3ed211P_oIRB"
setwd("/Users/yingjiang/Dropbox/learnings/Stats_data/Projects/Weibo/Advwords")

### Function that calls the hot user posts API in a particular point in time, and parse down the JSON text into lists.
# Name: gethupall()
# Takes in: Weibo developer user id ("source"), user access token, date, user's preference for separating or combining (convert from list to vector) the 21 categories.
# Returns: A list of Weibo posts from the 21 categories. The posts are either in 21 lists, or in a single vector
gethupall <- function(source = "3182990075",
                      access_token = "2.00orEdqCbHV6TDfe7c3ed211P_oIRB",
                      date, # Enter in the format of YYYYMMDD
                      seplist = TRUE) {
  cat <- c("default", "ent", "music", "sports", "fashion",
           "art", "cartoon", "games", "trip", "food",
           "health", "literature", "stock", "business", "tech",
           "house", "auto", "fate", "govern", "medium",
           "marketer")
  
  url <- character()
  url.data <- list()
  hupdata <- list()
  
  for(i in 1:length(cat)) {
    url <- c(url,
             paste("https://api.weibo.com/2/suggestions/users/hot.json?client_id=",
                   source,
                   "&access_token=",
                   access_token,
                   "&category=",
                   cat[i],
                   sep = ''))
  }
  
  for(i in 1:length(cat)) {
    url.data[[i]] <- GET(url[i])
    hupdata[[i]] <- fromJSON(rawToChar(url.data[[i]]$content))
    write(toJSON(fromJSON(rawToChar(url.data[[i]]$content))),
          paste("/Users/yingjiang/Dropbox/Learnings/Stats_data/Projects/Weibo/Advwords/hupdata_", date, "_Cat", i, ".json", sep=''))
  }
  
  # Gets user's preferences for separate or combined lists of topics.
  if(seplist == T) {
    names(hupdata) <- cat
    return(hupdata)
  }
  if(seplist == F) {
    hupdata2 <- list()
    for(i in 1:length(hupdata)) {
      hupdata2 <- c(hupdata2, hupdata[[i]])
    }
    names(hupdata2) <- cat
    return(hupdata2)
  }
}

### To call the function and get all hot users' posts data as a list.
hup_20151009 <- gethupall(Source = Source,
                          access_token = Access_token)

TASK 2: Extract basic post info

From the raw data, we extract some basic, potentially useful information of the posts, consisting of
- user ID
- post ID
- length of post
- number of reposts
- number of comments
- number of likes.

Operationally, this is where the API-call limitation started to frustrate. Getting data needed to happen in stages and rather manually. Not yet advanced in coding skills, I simply picked up the function call from the ith category and jth post where the API ran out of limit. Finally, a dataframe with the above features is written to file.

##### TASK 2: Get basic post info #####

### Function that gets basic post info
# Name: gethupinfo()
# Takes in: Weibo developer user ID, access token, hot-user-post data in the format of a list.
# Returns: e.g. user, postid, num reposts etc.
gethupinfo <- function(hupdata,
                       source = "3182990075",
                       access_token = "2.00orEdqCbHV6TDfe7c3ed211P_oIRB") {
  ## Get user id
  userid <- hupdata$id
  
  ## Get post id
  # postid <- "3840958959093449"
  postid <- hupdata$status$id
  
  ## Get length of post
  postlength <- nchar(hupdata$status$text)
  
  ## Get total number of reposts and comments
  repostdata <- GET(paste("https://api.weibo.com/2/statuses/count.json?source=",
                          source,
                          "&access_token=",
                          access_token,
                          "&ids=",
                          postid,
                          sep = ""))
  repostdata2 <- fromJSON(rawToChar(repostdata$content))
  ncomments <- repostdata2[[1]]$comments # 1304
  nreposts <- repostdata2[[1]]$reposts # 1343
  nlikes <- repostdata2[[1]]$attitudes # 7977
  
  return(data.frame(Userid = userid,
                    Postid = postid,
                    Post.length = postlength,
                    No.Comments = ncomments,
                    No.reposts = nreposts,
                    No.Likes = nlikes))
}


### Call gethupinfo()
# Make a temp dataframe for storing the results in patches.
hupinfo_20151009 <- data.frame()
a <- data.frame()

for(j in 87:length(hup_20151009[[17]])) {
  print(j)
  if ("status" %in% names(hup_20151009[[17]][[j]])) {
    a <- rbind(a, gethupinfo(hup_20151009[[17]][[j]]))
  }
}
hupinfo_20151009 <- rbind(hupinfo_20151009, a)

# Continue running the next is (categories); attach the results into a single dataframe.
a <- data.frame()
for(i in 18:length(hup_20151009)) {
  for(j in 1:length(hup_20151009[[i]])) {
    print(i)
    print(j)
    if ("status" %in% names(hup_20151009[[i]][[j]])) {
      a <- rbind(a, gethupinfo(hup_20151009[[i]][[j]]))
    }
  }
}
hupinfo_20151009 <- rbind(hupinfo_20151009, a)

# Write the final table to file. This concludes the basic post info for one session of hotposts-acquisition.
write.table(hupinfo_20151009,
            file = "C:/Users/jiangy/Dropbox/learnings/Stats_data/Projects/Weibo/Advwords/hupinfo_20151009.txt",
            sep = "\t",
            row.names = F)

TASK 3: Get comment texts

Extracting comments is slow. A post with a large number of comments (above 1,000) takes up more API-calls. No-longer-existent (deleted) posts would stop the execution, which was re-started manually. The longer the time lapses, the more rampant was post-deletion. The number of comments, too, visibly reduces with time, potentially due to system or owner-initiated comment censor and deletion. This unfortunately limits us from getting a full picture of user reactions. Out of the original 1084 posts I managed to save 844, whose comments are written to file.

##### TASK 3 Get comment texts #####

### Function to get comment texts:
# Name: getcomments2()
# Takes in: Weibo developer user ID, access token, post ID, post's no. comments
# Returns: a character vector of cmttext
getcomments2 <- function(source = "3182990075",
                        access_token = "2.00orEdqCbHV6TDfe7c3ed211P_oIRB",
                        postid,
                        ncomments) {

  # Get actual comment JSON data from API, at 200 comments a page.
  cmtpg <- ncomments %/% 200 + 1
  cmtdata <- list()
  
  for(i in 1:cmtpg) {
    urltemp <- paste("https://api.weibo.com/2/comments/show.json?source=",
                     source,
                     "&access_token=",
                     access_token,
                     "&id=",
                     postid,
                     "&count=200&page=",
                     i,
                     sep = '')
    url.data <- GET(urltemp)
    cmtdata[[i]] <- fromJSON(rawToChar(url.data$content))
  }
  
  # Combine all comments into a single list.
  # Each list element is a 
  cmtdataall <- list()
  for(i in 1:length(cmtdata)) {
    cmtdataall <- c(cmtdataall, cmtdata[[i]]$comments)
  }
  
  # Extract all comment texts from n pages of comment data.
  # The same kind of loop can be run for extracting all user profiles... from n pages of comment data.
  cmttext <- character()
  for(i in 1:length(cmtdata)) {
    for(j in 1:length(cmtdata[[i]]$comments)) {
      cmttext <- c(cmttext, cmtdata[[i]]$comments[[j]]$text)
      
    }
  }
  return(cmttext)
}


### Executing the get-comments function over all 1084 hot-user posts.
cmttext_20151009 <- list()

for(i in 1009:nrow(hupinfo_20151009)) {
  print(i)
  cmttext_20151009[[i]] <- getcomments2(postid = hupinfo_20151009$Postid[i],
                                        ncomments = hupinfo_20151009$No.Comments[i])
  print(length(cmttext_20151009[[i]]))
  # If post has been deleted, number of comments is 0. It's then not written to file.
  if(length(cmttext_20151009[[i]] != 0)) {
    write.table(cmttext_20151009[[i]],
                file = paste("C:/Users/jiangy/Dropbox/learnings/Stats_data/Projects/Weibo/Advwords/Raw/Comments_20151009/cmttext_20151009_",
                             i,
                             ".txt",
                             sep = ''),
                row.names = F,
                col.names = F,
                fileEncoding = "UTF-8")
  }
}

TASK 4: Analyze comment texts to extract features

From the comment texts, I extracted the following features:
1. Average number of words (Chinese characters) per comment.
2. Average number of words per comment for the top 20, longest comments.
3. Ratio of no. comments / no. reposts
4. Number of comments that contain keywords that indicate post is likely an ad.
5. Number of emoticon characters (defined: 4 characters per emoticon) as a fraction of total number of words in all comments.
6. Number of topics-mentions out of total number of comments.
7. Number of user-mentions, excluding replies, out of total number of comments.
8. Number of user-mentions that are replies.

These comment-extracted features will be combined with the posts’ basic features in the next step. It’s to be noted that the number of reposts also changes over time, but this API is not re-called to save time. Therefore the parameters No.reposts and Cmt.rep.ratio may not be the most up-to-date.

##### TASK 4 Analyze comment texts to extract features #####

### Function that analyzes comment texts
# Takes in:
# 1. A character vector of all comments
# 2. Number of reposts
# Returns:
# 1. Average number of words (Chinese characters) per comment.
# 2. Average number of words per comment for the top 20, longest comments.
# 3. Ratio of no. comments / no. reposts
# 4. Number of comments that contain keywords that indicate post is likely an ad.
# 5. Number of emoticon characters (defined: 4 characters per emoticon) as a fraction of total number of words in all comments.
# 6. Number of topics-mentions out of total number of comments.
# 7. Number of user-mentions, excluding replies, out of total number of comments.
analcomment3 <- function(cmttext, nreposts) {
  # P1: Get average number of words per comment.
  # P2: Get average number of words per comment for the first 20, longest comments.
  nwd <- as.numeric(sapply(cmttext, nchar))
  P1 <- ave(nwd)[1] # ***
  P2 <- ave(nwd[order(nwd, decreasing = T)][1:20])[1] # ***
  
  # P3: Get comments / reposts ratio
  P3 <- length(cmttext) / nreposts # ***
  
  # Keywords from comment text that indicate user has identified the post as an advertisement
  advwd <- c("????", "????", "????", "?İ?", "д??", "Ӫ??", "????", "????")
  # Other possibilities include
  # ????, Ʒ??, ?̼?
  
  # Find out how many of the comment contains
  # P4: ANY of the above adv words,
  # P5: emoticons,
  # P6: topics with double hashtag ##, (specifically, no. cmts w ## / total no. cmts)
  # P7: mentions of other users, (specifically, no. cmts w @ / total no. cmts)
  P4 <- length(grep(paste(advwd, collapse="|"), cmttext)) # ***
  P5 <- length(grep("\\[", cmttext)) * 4 / sum(nwd) # ***
  P6 <- length(grep("#", cmttext)) / length(nwd) # ***
  P7 <- length(grep("?ظ?@", grep("@", cmttext, value = T), invert = T)) / length(nwd) # Get every @ except for the replies ***
  return(c(P1, P2, P3, P4, P5, P6, P7))
}


### Execute the function
# Create a temporary dataframe to store the features.
b <- data.frame(Post.id = hupinfo_20151009$Postid,
                Ave.wds = numeric(nrow(hupinfo_20151009)),
                Ave.wds.20 = numeric(nrow(hupinfo_20151009)),
                Cmt.rep.ratio = numeric(nrow(hupinfo_20151009)),
                No.adv.wds = numeric(nrow(hupinfo_20151009)),
                No.emot = numeric(nrow(hupinfo_20151009)),
                No.topic = numeric(nrow(hupinfo_20151009)),
                No.mention = numeric(nrow(hupinfo_20151009)),
                No.replies = numeric(nrow(hupinfo_20151009)))

# Extract features from each hot-use-post
for(i in 1:nrow(hupinfo_20151009)) {
  print(i)
  if(length(cmttext_20151009[[i]] != 0)) {
    b[i, 2:ncol(b)] <- analcomment4(cmttext = cmttext_20151009[[i]],
                                    nrepost = hupinfo_20151009$No.reposts[i])
    # Note: nreposts also changes over time, but this API is not re-called to save time. Therefore this parameter may not be the most up-to-date.
  }
  print(b[i, ])
}

# Now all comment texts have been looked at, features extracted, and the feature dataframe saved to file.
write.table(b,
            file = "C:/Users/jiangy/Dropbox/learnings/Stats_data/Projects/Weibo/Advwords/cmtanal_20151009.txt",
            sep = "\t",
            row.names = F)

TASK 5: Combine the features for 20151009 hot posts

Basic and comment-features are combined. Then, more post attributes are added to the existing feature matrix:
1. Number of comments, updated; these are based on the actual number of comment texts extracted.
2. Post URLs. These are made up in preparation for classification. URLs are constructed from post mids.
3. An outcome vector in preparation for classification.
Finally the expanded matrix is written to file and is ready for analysis.

##### TASK 5: Complete the feature matrix for 20151009 hot posts #####

# Add a column of Updated comment vector
No.Comments.Upd <- sapply(cmttext_20151009, length)

# Add vector of post URL
for(i in 1:length(hupinfo_20151009$Postid)) {
  if(nchar(hupinfo_20151009$Postid[i]) == 16) {
    Postid62[i] <- Weibo10tomid(as.character(hupinfo_20151009$Postid[i]))
  } else {
    Postid62[i] <- "NIL"
  }
}

for(i in 1:length(hupinfo_20151009$Userid)) {
  Posturl[i] <- paste("www.weibo.com/",
                      hupinfo_20151009$Userid[i],
                      "/",
                      Postid62[i],
                      sep = "")
}

# Add Outcome vector for supervised learning.
Outcome <- numeric(nrow(b)) # 1 if post is an ad; 0 if not. 

# Combine basic post features, comment features, additional features, outcome into a single table
hupinfo_cmt_20151009 <- cbind(Posturl,
                              Postid62,
                              hupinfo_20151009,
                              No.Comments.Upd,
                              b[2:ncol(b)],
                              Outcome) # Entire dataset is ready for sampling.

write.table(hupinfo_cmt_20151009,
            file = "C:/Users/jiangy/Dropbox/learnings/Stats_data/Projects/Weibo/Advwords/hupinfo_cmt_20151009.txt",
            sep = "\t",
            row.names = F)

Analysis and discussion

Unsupervised learning using `kmeans`

Before clustering, the features are further cleaned:
1. Posts with zero comments (presumeably deleted) are removed.
2. Posts with less than 20 comments, and as a result NA as their average comment length for the top 20 longest comments, have their top-20-average-word-length set to 1.
3. Posts with zero reposts, and as a result infinitely large comment-to-repost ratio, have their comment-to-repost ratios set from Inf to 0.
4. Non-numeric features are removed.

##### kmeans unsupervised learning #####

library(cluster)
library(fpc)

### Read in data.
hupinfo_cmt_20151009 <- read.table("/Users/yingjiang/Dropbox/learnings/Stats_data/Projects/Weibo/Advwords/Dataframes/hupinfo_cmt_20151009_v3.txt",
                                   header = T,
                                   sep = "\t")

### Clean data.
# Remove deleted posts. These posts had existed when the data was initially, but had disappeared as the data was processed over days.
hupinfo_cmt_anal <- hupinfo_cmt_20151009[which(hupinfo_cmt_20151009$No.Comments.Upd != 0), ]
# Remove NAs.
hupinfo_cmt_anal$Ave.wds.20[is.na(hupinfo_cmt_anal$Ave.wds.20)] <- 1
hupinfo_cmt_anal$Cmt.rep.ratio[hupinfo_cmt_anal$Cmt.rep.ratio == Inf] <- 0
# Remove features that aren't numeric for clustering.
hup_unsup <- hupinfo_cmt_anal[, c(5, 7:17)]
colnames(hup_unsup)

##  [1] "Post.length"     "No.reposts"      "No.Likes"       
##  [4] "No.Comments.Upd" "Ave.wds"         "Ave.wds.20"     
##  [7] "Cmt.rep.ratio"   "No.adv.wds"      "No.emot"        
## [10] "No.topic"        "No.mention"      "No.replies"

The features are then scaled with mean standardization.

# Note: Later posts have undergone more owner- or system-initiated deletion. May not represent well the general response of netizens.
# Do feature scaling
# Feature scaling
hup_unsup_sca <- hup_unsup
for(i in 1:ncol(hup_unsup)) {
  hup_unsup_sca[, i] <- (hup_unsup[, i] - mean(hup_unsup[, i])) / sd(hup_unsup[, i])
}

We then run kmeans clustering on the data with 2 cluster centers.

# kmeans clustering
library(cluster)
library(fpc)
hup_kmean <- kmeans(hup_unsup_sca, 2)

plotcluster(hup_unsup_sca, hup_kmean$cluster,
            xlab = "", ylab = "")

Even though there appears to be clear boundaries between the 2 clusters (with slight overlap), developing an intuition for clustering of multidimensional data is known to be challenging. Some reduce dimension before performing further analysis. In this case, we will take it by faith that there exists two types of posts judging from the comment behaviors. We thought it’d be helpful to produce a pairs plot with clustering results visualized, from which we do further data exploration in the context of the initial hypotheses.

with(hup_unsup_sca, pairs(hup_unsup_sca, col=c(1:3)[hup_kmean$cluster]))

A closer look at the data

A. Behavior with `No.adv.wds` (indicative vocabulary in comments)

No.adv.wds is the feature name for crowd-supplied indicative vocabulary. While the number of likes and resposts on a post (No.Likes and No.reposts, Figure 3a - b) generally increases slightly with the presence of No.adv.wds (before falling off), the number of comments show a different trend. No.Comments are more or less randomly distributed at low No.adv.wds; and form 2 clusters when there are high No.adv.wds (Figure 3c). (There’s also a general increasing trend between the two.) The high-comment cluster could be due to a large addition of shuijun comments, while the low-comment cluster could show a lack of interest and engagement with these ad posts aside from the users who called the posts out. The high-comment group is larger than the low group. This shows more “ad” posts are promoted than not.

Considering comment and repost together, we see that the higher No.Adv.wds, the lower the comment/repost ratio (Figure 3d). It is known that Weibo marketting effectiveness / payoff is measured in repost quantities. Comparing between the patterns between No.reposts and Cmt.rep.ratio, the sharply-decreasing trend with Cmt.rep.ratio suggests that an effort to repost / publicize the ad posts is dwarfed by real user engagement.

##### TASK 7: Discussion of hypotheses #####
# general behavior with ad.wds
par(mfrow = c(2,2))
plot(hup_unsup_sca$No.adv.wds, hup_unsup_sca$No.Likes,
     xlab = "Number of indicative words",
     ylab = "Number of likes")
plot(hup_unsup_sca$No.adv.wds, hup_unsup_sca$No.reposts,
     xlab = "Number of indicative words",
     ylab = "Number of reposts")
plot(hup_unsup_sca$No.adv.wds, hup_unsup_sca$No.Comments.Upd,
     xlab = "Number of indicative words",
     ylab = "Number of comments")
plot(hup_unsup_sca$No.adv.wds, hup_unsup_sca$Cmt.rep.ratio,
     xlab = "Number of indicative words",
     ylab = "Comment-to-repost ratio")
mtext("General behavior with No.Adv.wds", outer = TRUE, cex = 1.5)

B. Behavior with reposts

A small fraction - 5% - of posts have No.reposts > 10K times. The wildly shared posts exhibit isolated peaks at around every 10K (Figure 4). There are obvious discontinuities between 30 and 40K, as well as between 60 and 70K. This may indicate a repost “target”, a KPI exacted by clients of Weibo-marketers that measure the effectiveness of the marketting effort. Looking at more general parameters (Figure 4b - c) such as Post.length and Ave.wds.20(the average comment length of the top 20 wordiest comments, a measurement of comment wordiness), it’s also observed that No.reposts forms “layers”, roughly at 0, 30K and 60K reposts. Even more peculiarly, No.reposts spikes at around 1200 and 2000 comments (Figure 4d). Are these paid posts with specific comments targets e.g. at 1K marks (on top of repost targets)?

# General behavior with reposts
par(mfrow = c(2, 2))
hist(hup_unsup$No.reposts[hup_unsup$No.reposts>5000], 100,
     xlab = "Number of reposts",
     ylab = "Number of posts",
     main = "")
plot(hup_unsup$Post.length, hup_unsup$No.reposts,
     xlab = "Length of post",
     ylab = "Number of reposts")
plot(hup_unsup$Ave.wds.20, hup_unsup$No.reposts,
     xlab = "Ave. length of 20-most-wordy comments",
     ylab = "Number of reposts")
plot(hup_unsup$No.Comments.Upd, hup_unsup$No.reposts,
     xlab = "Number of comments",
     ylab = "Number of reposts")
mtext("General behavior with reposts", outer = TRUE, cex = 1.5)

C. Behavior with user-mentions and emoticon-uses

User-mentions, especially user-mentions that are replies to previous comments, are taken to be elevated levels of engagement within a Weibo post. Emoticon-use, instead of text generation, on the other hand, correspond to “lazier” engagement as compared to typical behavior among the emotional, word-loving Chinese netizens.

Emoticon-repost relationship (Figure 5a) shows a similar “3-layer” phenomenon with No.reposts dropping drastically at emoticon ~ 0.8 (80% of the characters in comment texts are emoticon-characters). Similarly, lower emoticon-use correspond to more inter-mentions among commenters (Figure 5b), indicating a higher degree of engagement. As for the user mentions that are explicit replies among commenters, both high No.reposts and No.adv.wds correspond to low replies (Figure 5c - d).

# General behavior user-mentions, emoticons

par(mfrow=c(2,2))

plot(hup_unsup$No.emot, hup_unsup$No.reposts,
     xlab = "Fraction of emoticon characters",
     ylab = "Number of reposts")
plot(hup_unsup$No.emot, hup_unsup$No.mention,
     xlab = "Fraction of emoticon characters",
     ylab = "Number of user-mentions that aren't replies")
plot(hup_unsup$No.adv.wds, hup_unsup$No.replies,
     xlab = "Number of indicative words",
     ylab = "Number of user-mentions that are direct replies")
plot(hup_unsup$No.reposts, hup_unsup$No.replies,
     xlab = "Number of reposts",
     ylab = "Number of reposts")
mtext("General behavior with user-mentions and emoticons", outer = TRUE, cex = 1.5)

Conclusion and outlook

The most interesting discovery from the study is that the feature No.Adv.wds - an indicative vocabulary in comment texts that points to a post as a piece of sponsored content - is indeed found to correlate to (1) a lower repost-to-comment ratio, (2) a lower number of inter-user replies. These features in turn indicate lower user engagement, possibly attributed to the lack of interest in inorganic contents.

In addition, a layered structure was found in No.reposts, very likely indicating targets set by commercial partners and in turn non-organic contents. These are interesting trends that prepare the data for model-building ultimately towards predicting how much of Weibo contents are sponsored.