Introduction and Motivation

Introduction. Stack Overflow is a question and answer site that was founded in 2008. With over 40 million unique visitors each month, it remains one of the premiere resources for students, researchers, and industry professionals for questions and answers related to software engineering [1]. Like other QA sites [2, 3, 4], Stack Overflow utilizes a crowdsourcing framework for content production and site moderation. Stack Overflow strides to provide the software engineering community with the highest quality questions and answers to common and complex programming tasks. This is accomplished through a ladder system in which users who ask good questions are awarded points and users who ask questions classified as (1) off topic, (2) not constructive, (3) not a real question, or (4) too localized lose points, and risk having their question closed. This applies to answers too. Multiple answers can be given for a single question, but only the best answer will be marked as such. And with 6% of all Stack Overflow questions being marked as closed [5], a natural question to ask is why? What features of a Stack Overflow question make it good or bad? Answering these questions is the primary goal of this research project.

Motivation. As Stack Overflow is moderated by users, it can take time to close posts that are off topic, not constructive, not a real question, or too localized. These posts also lead to wasted efforts, not only by the poster, but also the moderators and other community members who take the time to read, understand, and answer these questions. To combat the problem of poor quality questions, questions could be tagged and closed before any human effort is exerted. To do this, we can use the meta-data of the poster and the question itself to build a classifier that will be able to tag these questions appropriately.

Research Question

Problem Statement: Build a classifier that predicts whether a question will remain open or closed based on different features of a Stack Overflow question. The original problem statement and datasets used can be found on Kaggle.

RQ1: What are the main features that indicate whether a new post will remain open or become closed?

RQ2: Why do these features play the roles they play?

Tools and Core Libraries

  1. RStudio (1.0.136)
    • Development and data visualization
  2. NetBeans (8.2)
    • Development and data preprocessing
  3. StandfordCoreNLP (3.7.0)
    • Core natural language software
  4. randomForest (4.6-12)
    • Classification
    • Supervised learning algorithm
  5. sentimentR (1.0.0)
    • Sentiment analysis
  6. ggplot2 (2.2.1)
    • Data visualization

Dataset

The dataset used to investigate our research questions was a 189-megabyte (MB) training dataset set called train-sample_October_9_2012_v2.csv. This dataset contains 178,352 observations with a date range from August 2008 to October 2012. The main difference between this dataset and the dataset presented during our previous project update is that the current dataset contains an equal number of open and closed posts.

As before, let’s see what we have to work with:

library(ggplot2)
library(ggthemes)
library(corrplot)
library(scales)
library(dplyr)
library(data.table)

# Read in data
train <- data.table(read.csv('train_October_9_2012.csv', stringsAsFactors = F))

#Summarize content in data table
str(train)

The str function gives us some nice summary statistics about the types of features that are in our data.table.

Classes ‘data.table’ and 'data.frame':  178352 obs. of  36 variables:
 $ X                                  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ PostId                             : int  10035653 8922537 5962216 10070625 ...
 $ PostCreationDate                   : chr  "2012-04-05 20:37:35" "2012-01-19 07:38:27" ...
 $ OwnerUserId                        : int  1159226 1157921 696219 490895 ...
 $ OwnerCreationDate                  : chr  "2012-01-19 18:46:16" "2012-01-19 07:31:34" ...
 $ ReputationAtPostCreation           : int  1 1 40 1 28 50 10 1 2422 38 ...
 $ OwnerUndeletedAnswerCountAtPostTime: int  0 0 2 1 0 13 0 0 91 0 ...
 $ Title                              : chr  "what is the best way to connect my" ...
 $ BodyMarkdown                       : chr  "I know this question can be answered by" ...
 $ Tag1                               : chr  "c++" "php" "iphone-sdk-4.0" "linux" ...
 $ Tag2                               : chr  NA "xml" NA "module" ...
 $ Tag3                               : chr  NA "cakephp" NA "kernel" ...
 $ Tag4                               : chr  NA "zip" NA NA ...
 $ Tag5                               : chr  NA NA NA NA ...
 $ PostClosedDate                     : chr  "2012-04-05 23:31:34" "2012-01-19" ...
 $ OpenStatus                         : chr  "too localized" "not a real question" "open" ...
 $ TimeBetweenJoiningAndPosting       : int  11 0 5 75 12 90 1 0 74 2 ...
 $ NumberOfWordsInTitle               : int  11 12 5 4 16 9 8 7 7 6 ...
 $ NumberOfQuestionsInTitle           : int  1 0 0 0 1 0 0 0 1 0 ...
 $ NumberOfWordsInBody                : int  197 78 86 67 26 69 68 45 72 151 ...
 $ NumberOfTags                       : int  1 4 1 3 2 3 2 2 3 2 ...
 $ NumberOfQuestionsInBody            : int  3 0 2 1 1 1 1 1 1 1 ...
 $ TagsContainHomework                : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PostBodyContainsCodeFragment       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ TitleNoun                          : int  3 5 3 1 6 5 3 2 3 2 ...
 $ TitleAdverb                        : int  0 0 0 0 2 0 0 1 1 0 ...
 $ TitleVerb                          : int  2 2 1 2 4 0 1 1 3 1 ...
 $ TitleAdjective                     : int  1 2 0 0 0 1 0 1 0 1 ...
 $ BodyNoun                           : int  51 27 24 17 13 23 23 12 13 48 ...
 $ BodyAdverb                         : int  21 3 6 3 4 12 10 7 9 25 ...
 $ BodyVerb                           : int  47 16 18 20 5 16 13 9 24 31 ...
 $ BodyAdjective                      : int  14 1 6 4 2 4 4 3 2 6 ...
 $ HomeworkInTitle                    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ HomeworkInBody                     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ SentimentTitle                     : num  0.1508 0.1155 0 0 0.0625 ...
 $ SentimentBody                      : num  0.2242 0.0255 -0.0113 -0.1511 -0.01

From this output, we can see that we have 178,352 observations and 36 predictor variables. The features in this dataset include the original features, as well as the engineered features we created as a result of our data exploration presented in project update two, using the StandfordCoreNLP and sentimentR libraries.

Below you will find the name of each feature, its function, range of values, and whether or not it was used in the classifier. Note, additional features (i.e., Day, Month, Year) were added to this dataset as a result of further data exploration, which we will show in the data exploration section of this document.

Variable Name Description Range Classifier
PostId Unique post identifier [17, 12810064] No
PostCreationDate Post Date [8-1-08, 10-9-12] No
OwnerUserId Unique id of user [2, 1766790] No
OwnerCreationDate User age [7-31-08, 10-9-12] No
ReputationAtPostCreation Reputation of user [-32, 155182] No
OwnerUndeletedAnswerCountAtPostTime Number of undeleted posts [0, 5031] No
Title Question title NA No
BodyMarkdown Question content NA No
Tag1 Domain of question NA No
Tag2 Same (optional) NA No
Tag3 Same (optional) NA No
Tag4 Same (optional) NA No
Tag5 Same (optional) NA No
PostClosedDate Post close date [8-22-08, 10-25-12] No
OpenStatus Post open status [0, 1] Yes
TimeBetweenJoiningAndPosting User join date - post date [0, 218] Yes
NumberOfWordsInTitle # of words in title [1, 40] Yes
NumberOfQuestionsInTitle # of question marks in title [0, 6] Yes
NumberOfWordsInBody # of words in body [1, 12066] No
NumberOfQuestionsInBody # of question marks in body [0, 357] Yes
NumberOfTags # of tags used [1, 5] Yes
TagsContainHomework Is homework related? [0, 1] Yes
PostBodyContainsCodeFragment Body contains markup [0, 1] Yes
TitleNoun # of nouns in title [0, 21] Yes
TitleAdverb # of adverbs in title [0, 14] Yes
TitleVerb # of verbs in title [0, 10] Yes
TitleAdjective # of adjectives in title [0, 11] Yes
BodyNoun # of nouns in body [0, 4518] Yes
BodyAdverb # of adverbs in body [0, 2999] Yes
BodyVerb # of verbs in body [0, 843] Yes
BodyAdjective # of adjectives in body [0, 2669] Yes
HomeworkInTitle Is homework related? [0, 1] Yes
HomeworkInBody Is homework related? [0, 1] Yes
SentimentTitle Sentiment of title text [-2.26, 2.19] Yes
SentimentBody Sentiment of body text [-379.12, 2285.32] Yes
MembershipStatus Milestone status [1, 4] Yes
Day Day of week [0, 6] Yes
Month Month of year [0, 11] Yes
Year Year [2008, 2012] Yes
TagPopularity Tag popularity [1, 3] Yes

Data Exploration

Open Status

Currently, our OpenStatus variable is a factor with five levels (Open, Too localized, Not constructive, Not a real question, Off topic). Since we are dealing with a classification problem, let’s turn it into a binary variable (0 = closed, 1 = open) with two levels. To be clear, a question that is marked as closed means that the question fell into one of the four negative categories listed above. It does not mean that the question was answered and then marked as closed. Furthermore, a question that is marked as open, may or may not have an answer, but remains open because of its potential usefulness to the Stack Overflow community.

# Turn OpenStatus feature into variable with four levels, to a variable with two levels
train[, OpenStatus := ( ifelse(PostClosedDate == "", 1, 0))]
# View summary statistics for this variable
table(train$OpenStatus)
0     1 
89176 89176

Reputation at Post Creation

The first thing that caught our attention was the ReputationAtPostCreation variable. Let’s see what the distributions of reputation are:

ggplot(train, aes(x=ReputationAtPostCreation)) + 
  geom_density() +
  labs(x = 'User Reputation Points', y = 'Density', title = 'Reputation Density Plot') +
  theme(plot.title = element_text(hjust = 0.5))

It looks like that a clear majority of users who ask questions have reputations below 200. Since this data is so skewed, it could be worth log transforming to try and obtain something more along the lines of a normal distribution, but let’s do something else. Let’s create a new feature that encapsulates the user’s experience. We’ll do this by using the Stack Overflow milestones as our guide.

train$MembershipStatus[train$ReputationAtPostCreation >= 20000] <- 'Trusted'
train$MembershipStatus[train$ReputationAtPostCreation < 20000 & train$ReputationAtPostCreation >= 1000] <- 'Established'
train$MembershipStatus[train$ReputationAtPostCreation < 1000 & train$ReputationAtPostCreation >= 200] <- 'Avid'
train$MembershipStatus[train$ReputationAtPostCreation < 200] <- 'New'

Okay, cool. We’ve created a new feature called MembershipStatus. Let’s see how it relates to our outcome variable (OpenStatus).

ggplot(train, aes(x = MembershipStatus, fill = factor(OpenStatus))) +
  geom_bar(stat='count', position='dodge') +
  labs(x = 'StackOverflow Status', y = 'Count', title = 'Number of Open and Closed Posts by User Status') +
  guides(fill=guide_legend(title='Open Status')) +
  scale_fill_hue(labels=c('Closed', 'Open')) +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5))

As expected, new users have more questions closed than avid users, avid users have more questions closed than established users, and established users have more questions closed than trusted users.

Post Creation Date

Currently, the PostCreationDate feature is in a month-day-year format. This isn’t very useful on its own, so let’s split this variable into three separate features and see how they might relate to the outcome variable.

Day

dates <- gsub( " .*$", "", train$PostCreationDate)
# Convert date format to day of week
train$Day <- weekdays(as.Date(dates, format = "%m-%d-%Y"))
# Order days of week appropriately
train$Day <- factor(train$Day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"), ordered = TRUE)
ggplot(train, aes(x = Day, fill = factor(OpenStatus))) +
  geom_bar(stat='count', position='dodge') +
  labs(x = 'Day of Week', y = 'Count', title = 'Number of Open and Closed Posts by Day') +
  guides(fill=guide_legend(title='Open Status')) +
  scale_fill_hue(labels=c('Closed', 'Open')) +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5))

It looks like there is a slight penalty for asking questions on the weekends. Let’s create a new feature called isWeekend to represent this. Note, because of time, we did not end up using this feature in our final analysis.

train[, isWeekend := ( ifelse(Day == 'Saturday' | Day == 'Sunday', 1, 0))]

Month

# Convert date format to months
train$Month <- months(as.Date(dates, format = "%m-%d-%Y"))
# Order months appropriately
train$Month <- factor(train$Month, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"), ordered = TRUE)
ggplot(train, aes(x = Month, fill = factor(OpenStatus))) +
  geom_bar(stat='count', position='dodge') +
  labs(x = 'Month', y = 'Count', title = 'Number of Open and Closed Posts by Month') +
  guides(fill=guide_legend(title='Open Status')) +
  scale_fill_hue(labels=c('Closed', 'Open')) +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5))

An interesting trend is shown between the months of July, August, September and October. More questions get asked and more questions get closed. Perhaps this is due to the beginning of the Fall semester?

Year

# Convert date format to years
train$Year <- year(as.Date(dates, format = "%m-%d-%Y"))
ggplot(train, aes(x = Year, fill = factor(OpenStatus))) +
  geom_bar(stat='count', position='dodge') +
  labs(x = 'Year', y = 'Count', title = 'Number of Open and Closed Posts by Year') +
  guides(fill=guide_legend(title='Open Status')) +
  scale_fill_hue(labels=c('Closed', 'Open')) +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5))

As the popularity of Stack Overflow has increased, more people are asking questions and more questions are being closed.

Tags

There are several different tags within the Tag1, Tag2, Tag3, Tag4, and Tag5 columns. Let’s look at how many there are, as well as their distribution.

# Grab the five tag columns
tag_counts <- train[,10:14]
# Convert to vector of tags
aggregated_counts <- as.vector(as.matrix(tag_counts[,c('Tag1', 'Tag2', 'Tag3', 'Tag4', 'Tag5')]))
# Sort tags based on occurrence
aggregated_counts <- as.data.table(sort(table(aggregated_counts), decreasing = TRUE))
# Remove row corresponding to the empty tag
aggregated_counts <- aggregated_counts[-1,]

Let’s see which programming languages are the most popular. We’ll look at the top 25.

ggplot(aggregated_counts[1:25], aes(x = reorder(Tag1, -N), y=N, fill = Tag1)) + 
  geom_bar(stat='identity', position = 'dodge') +
  labs(x='Tag', y='Count', title = 'Top 25 Most Popular Tags', fill = 'Tag Name') +
  theme(axis.text.x=element_text(angle=90, hjust=1)) +
  theme(plot.title = element_text(hjust = 0.5))

It looks like php wins. Now, let’s look at summary statistics for the aggregated_counts data.table:

summary(aggregated_counts)
 aggregated_counts  N           
 Length:20050       Min.   :    1.00  
 Class :character   1st Qu.:    1.00  
 Mode  :character   Median :    3.00  
                    Mean   :   24.41  
                    3rd Qu.:    8.00  
                    Max.   :16829.00

We know from the previous project update that there are numerous singleton tags used in this dataset. Let’s remove them and only consider tags that have been used more than 25 times before proceeding.

aggregated_counts <- subset(aggregated_counts, N > 25)
aggregated_counts$N <- log(aggregated_counts$N, 10)
summary(aggregated_counts)
 Length:2087        Min.   :1.415  
 Class :character   1st Qu.:1.556  
 Mode  :character   Median :1.748  
                    Mean   :1.880  
                    3rd Qu.:2.093  
                    Max.   :4.226

This looks a little bit better. Let’s see what the distribution looks like:

ggplot(aggregated_counts, aes(x=N)) + 
  geom_density() +
  labs(x='Tag Occurrence', y='Density', title = 'Tag Density Plot') +
  theme(plot.title = element_text(hjust = 0.5))

That looks great. Now let’s try to group tags based on popularity to see how they may relate to the outcome variable. We’ll do this by classifying tags with a count below the mean as Unpopular, tags with a count between the mean and third quartile as Moderate, and tags with a count between the third quartile and max as Popular.

minimum <- min(aggregated_counts$N)
average <- mean(aggregated_counts$N)
third_quartile <- quantile(aggregated_counts$N)["75%"]
maximum <- max(aggregated_counts$N)

breaks <- c(minimum, average, third_quartile, maximum)
labels <- c("Unpopular", "Moderate", "Popular")
bins <- cut(aggregated_counts$N, breaks, include.lowest = T, right=FALSE, labels=labels)

aggregated_counts$TagPopularity <- bins
names(aggregated_counts)[names(aggregated_counts) == "aggregated_counts"] = "Tag1"
setDT(train)[aggregated_counts, TagPopularity := TagPopularity, on = .(Tag1)]

Owner Creation Date

We can get some interesting information from the OwnerCreationDate feature. Let’s use the year the user created an account to get his or her account age on Stack Overflow.

birthdayDates <- gsub( " .*$", "", train$OwnerCreationDate)
birth <- year(as.Date(birthdayDates, format = "%m-%d-%Y"))
age <- sapply(birth, function(x) 2012-x)
train$Age <- age
ggplot(train, aes(x = Age, fill = factor(OpenStatus))) +
     geom_bar(stat='count', position='dodge') +
     labs(x = 'Account Age (in years)', y = 'Count', title = 'Number of Open and Closed Posts by Account Age') +
     guides(fill=guide_legend(title='Open Status')) +
     scale_fill_hue(labels=c('Closed', 'Open')) +
     theme_few() +
     theme(plot.title = element_text(hjust = 0.5))

Parts of Speech and Sentiment

As mentioned previously, we also engineered features related to the parts of speech and sentiment of a post. In this context, parts of speech refer to the number of nouns, verbs, adverbs, and adjectives in the title and body of a post. Sentiment refers to the tone or attitude of the post as represented in the title or body of the post and takes on a value between less than zero and greater than one, where a sentiment less than one is considered negative, a sentiment near zero being neutral, and a sentiment greater than zero being positive. Let’s get an idea about how these features relate to the outcome variable:

library(corrplot)
cor_data = cor(train)
corrplot(cor_data, method="square")

Cool, it looks like the parts of speech features we engineered have a slight to moderate correlation to our outcome variable. Unfortunately, the sentiment features we engineered appear to have a negative correlation, so we will not pursue them in our data exploration. Let’s take a closer look at the parts of speech features.

BodyLength and TitleLength

b_words <- ggplot(train, aes(x = MembershipStatus, y = NumbersOfWordsInBody, fill = factor(OpenStatus))) +
  stat_summary(fun.y="mean", geom="bar", position='dodge') + 
  labs(x = 'StackOverflow Status', y = 'Mean Number of Words in Body', title = 'Avg(Words) in Body Based on User Status') +
  guides(fill=guide_legend(title='Open Status')) +
  scale_fill_hue(labels=c('Closed', 'Open')) +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5))
  
t_words <- ggplot(train, aes(x = MembershipStatus, y = NumberOfWordsInTitle, fill = factor(OpenStatus))) +
  stat_summary(fun.y="mean", geom="bar", position='dodge') + 
  labs(x = 'StackOverflow Status', y = 'Mean Number of Words in Title', title = 'Avg(Words) in Title Based on User Status') +
  guides(fill=guide_legend(title='Open Status')) +
  scale_fill_hue(labels=c('Closed', 'Open')) +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5))
  
grid.arrange(b_words, t_words, ncol = 2)

BodyNouns and TitleNouns

b_words <- ggplot(train, aes(x = MembershipStatus, y = BodyNoun, fill = factor(OpenStatus))) +
  stat_summary(fun.y="mean", geom="bar", position='dodge') + 
  labs(x = 'StackOverflow Status', y = 'Mean Number of Nouns in Body', title = 'Avg(Nouns) in Body Based on User Status') +
  guides(fill=guide_legend(title='Open Status')) +
  scale_fill_hue(labels=c('Closed', 'Open')) +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5))
  
t_words <- ggplot(train, aes(x = MembershipStatus, y = TitleNoun, fill = factor(OpenStatus))) +
  stat_summary(fun.y="mean", geom="bar", position='dodge') + 
  labs(x = 'StackOverflow Status', y = 'Mean Number of Nouns in Title', title = 'Avg(Nouns) in Title Based on User Status') +
  guides(fill=guide_legend(title='Open Status')) +
  scale_fill_hue(labels=c('Closed', 'Open')) +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5))
  
grid.arrange(b_nouns, t_nouns, ncol = 2)

Removing Collinear Features

Before building our model, we should remove some of the features that are collinear, as they can affect the performance of our model. Let’s see which features are collinear by using a 75% threshold and remove them:

result = findCorrelation(cor_data, cutoff=0.75)
cor_data[, result]
library(knitr)
library(data.table)
corr_tbl <- data.table(read.csv('corr.csv', sep = '\t', stringsAsFactors = T))
kable(corr_tbl)
X NumberOfWordsInBody BodyNoun BodyAdjective ReputationAtPostCreation
ReputationAtPostCreation 0.0145523 -0.0045167 0.0032893 1.0000000
TimeBetweenJoiningAndPosting 0.0186692 -0.0049861 -0.0040282 0.2998619
NumberOfWordsInTitle 0.0422945 0.0191837 0.0191390 0.0270069
NumberOfQuestionsInTitle -0.0587115 -0.0728418 -0.0498797 0.0739489
NumberOfQuestionsInBody 0.2312444 0.2017594 0.2241463 0.0220877
TagsContainHomework 0.0356223 0.0292592 0.0273377 -0.0084125
PostBodyContainsCodeFragment 0.0441115 0.0499384 0.0543177 -0.0039682
TitleNoun 0.0649369 0.0543866 0.0281176 -0.0056978
TitleAdverb 0.0274721 0.0242831 0.0257428 0.0162788
TitleVerb 0.0324711 0.0199504 0.0193019 0.0273074
TitleAdjective 0.0309499 0.0036676 0.0365767 0.0302319
BodyNoun 0.8950727 1.0000000 0.7921082 -0.0045167
BodyAdverb 0.7330888 0.8121151 0.7610987 -0.0009667
BodyVerb 0.8079848 0.6226818 0.5231279 0.0343449
BodyAdjective 0.7487115 0.7921082 1.0000000 0.0032893
HomeworkInTitle 0.0120184 0.0043175 0.0056380 -0.0017060
HomeworkInBody 0.0657162 0.0429439 0.0387880 -0.0077251
SentimentTitle -0.0623379 -0.0878986 -0.0626156 0.0108036
SentimentBody 0.0045638 0.0077567 0.0026799 -0.0007069
Year 0.0442817 0.0597827 0.0441286 -0.0358436
NumberOfTags 0.1322587 0.0849063 0.0651901 0.0285871
DayOfWeek 0.0038018 0.0012470 0.0070569 -0.0034225
MonthNumber -0.0033319 -0.0028840 0.0007746 -0.0037980
MembershipStatus 0.0001896 -0.0210451 -0.0112578 0.2736178
PopularityNumber -0.0252074 -0.0291668 -0.0291960 0.0031493
OwnerUndeletedAnswerCountAtPostTime 0.0188627 -0.0029363 0.0052784 0.9117019
NumberOfWordsInBody 1.0000000 0.8950727 0.7487115 0.0145523
OpenStatus 0.0988113 0.0800382 0.0609730 0.0615442

We can see that OwnerUndeletedAnswerCountAtPostTime is highly correlated with ReputationAtPostCreation, and that NumberOfWordsInBody is highly correlated with both BodyNoun and BodyVerb. Let’s remove the OwnerUndeletedAnswerCountAtPostTime and NumberOfWordsInBody features.

train$OwnerUndeletedAnswerCountAtPostTime <- NULL
train$NumberOfWordsInBody <- NULL

Results

Model

Now that we have engineered several features, we will now create a model to classify posts based on these features. First, we want to partition off some of our data for a final set of testing data. We used a 70-30% split to partition our dataset. This allows us to train the model on 70% of the data, then see how well it works with the remaining 30%.

t_idx = sample(seq_len(nrow(traindata)),size = floor(0.7 * nrow(traindata)))
# 70% to train
train = traindata[t_idx,]
# 30% to test
test = traindata[-t_idx,]
nrow(train)
nrow(test)
# Training size
124,846
# Test size
53,506

Now, let’s build our model. We’ll do this by using the randomForest package in R. First, let’s do a vanilla run to benchmark our model. We will build the model using an ntree value of 500, which specifies the number of trees to use and set the importance variable to true so that we can get the resulting feature importance statistics.

rf <- randomForest(formula = as.factor(OpenStatus) ~ ., data = train, importance = T, ntree = 500)
rf
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of  error rate: 29.48%

Confusion matrix:
      0     1   class.error
0 43873 18709   0.2989518
1 18095 44169   0.2906174

Not bad. We received an out-of-bag (OOB) error of 29.48%, which put very simply is the prediction error of our model. We can also see from the confusion matrix our performance in classifying each of our observations based on the two levels of our outcome variable. It looks like we are equally good at predicting both open and closed posts. Let’s look at this graphically:

plot(rf, ylim=c(0, 0.40))

From this graph, we see exactly what was shown in the confusion matrix. The black line is our OOB error (29.48), the red line is our error rate for closed questions (29.89), and the green line is our error rate for open questions (29.06).

We can also see how the number of trees used to construct the classifier impacts the resulting error rate. As the number of trees used approaches 500, we can see the error rate beginning to become parallel with the x-axis, which tells us: (1) 500 was a pretty good value for the number of trees and (2) increasing the number of trees (past 500) will not improve the error rate.

Let’s now look at the importance of each feature:

varImpPlot(rf)

From these graphs, we can see that the top features are:

  • BodyNoun
  • SentimentBody
  • BodyAdverb
  • BodyVerb
  • ReputationAtPostCreation
  • Year

We also made several efforts to improve the accuracy of our model. Our first method involved tweaking the input parameters to the randomForest algorithm. The parameters used included:

  • mtry (Number of features tried at each node split)
  • nodesize (Minimum node size)
  • ntrees (Number of trees to build)

We found that an mtry value of 5, a nodesize of 12, and ntrees value of 500 provided marginally better results than our benchmark.

rf_best <- randomForest(formula = as.factor(OpenStatus) ~ ., data = train, importance = T, ntree = 500, nodesize = 12)
rf_best
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of  error rate: 29.35%

Confusion matrix:
      0     1   class.error
0 43900 18682   0.2985203
1 17966 44298   0.2885455

From this graph, we can see a slight improvement in our OOB error (29.35), little to no improvement in our error rate for closed questions (29.85), and a slight improvement with our error rate for open questions (28.85). Let’s see if the variable importance statistics changed at all:

varImpPlot(rf_best)

It does not look like much has changed. The Year feature seems to have climbed a few ranks in the MeanDecreaseGini graph, but other than that, not much has changed after tuning the parameters.

Our second attempt at improving our model involved using a smaller number of features. According to Han, Guo, and Yu [6], less important features can affect model performance when using Random Forests methods. Thus, they suggest the following algorithm:

In Step 1:

  • Run the random forest algorithm
  • Rank every feature using the MeanDecreaseAccuracy (MDA) and MeanDecreaseGini (MDG)

In Step 2:

  • Subset the top 50 percent of highest scoring features
  • Run random forest using this subset of features and inspect the error rate

If the error rate continues to decrease during each iteration, then we follow the method again, and keep doing it until the error rate increases, or there are no more features left [6]. Let’s try the algorithm using the same parameters used in our previous model and see what happens:

rf_han_guo_yu = randomForest(as.factor(OpenStatus) ~ BodyNoun + BodyAdverb + SentimentBody
                  + BodyVerb + ReputationAtPostCreation + Year 
                  + TimeBetweenJoiningAndPosting + SentimentTitle + BodyAdjective
                  + MonthNumber + NumberOfWordsInTitle + TitleNoun + DayOfWeek,
                  data = train,
                  importance = T,
                  ntree = 500,
                  mtry = 5,
                  nodesize = 12)
rf_han_guo_yu
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of  error rate: 30.5%

Confusion matrix:
      0     1   class.error
0 43236 19346   0.3091304
1 18738 43526   0.3009444

Oh, no! Our error rate increased. It went from 29.35 when considering all features to 30.50 when considering only half of them. I think we’ll stop here, as removing any more features will likely continue to increase our error rate.

Prediction

Earlier, we split our data into a 70/30 training set and used 70% of the data to train our model. Now let’s do some predictions using the remaining 30% of test data.

prediction <- predict(rf_best, newdata = test[-26])
prediction_table = table(prediction, test$OpenStatus);
prediction_table
prediction_table      0     1
                0 18552  7804
                1  8042 19108

One question we may want to ask about this confusion matrix is how precise is our classifier? In other words, how many observations did we predict correctly? We can calculate this using a simple formula: \[P = \frac{TP}{TP + FP}\]

precision = prediction_table[2,2]/(sum(prediction_table[2,1:2]))
precision
0.7037937

How about recall? Recall is the proportion of all real positive observations that are correct. We can calculate this using the following formula: \[R = \frac{TP}{TP + FN}\]

recall = prediction_table[2,2]/(sum(prediction_table[1:2,2]))
recall
0.7100178

Let’s also look at the F-Score, which considers both precision and recall and can be calculated as follows: \[F = 2\frac{P*R}{P+R}\]

f1Score = (2*precision*recall)/(precision+recall)
f1Score
0.7068921

We can also look at the overall model accuracy by considering the number of correct predictions over all predictions: \[A = \frac{TP+TN}{TP+TN+FP+FN}\]

accuracy = (prediction_table[2,2] + prediction_table[1,1])/(sum(prediction_table[1:2,1:2]))
accuracy
0.7038463

Discussion

RQ1: What are the main features that indicate whether a new post will remain open or become closed? As was seen in the model results, the major factors playing a role in the model are the composition and sentiment of the body. These factors were significant in determining whether a post will remain open or closed. Unfortunately, some of the variables we had hoped to be strong features, such as the presence of homework in the title or body, membership status, and whether the post had code fragments were not very important.

RQ2: Why do these features play the roles they play? When considering the post composition, the number of nouns, verbs, and so on make sense as key factors. Having too large or small of a post can impact the ability of the post to convey a meaningful question. Further, the sentiment of the post can also play a key role. If the post contains a negative sentiment, it may convey an unintended message, meaning the post will be marked to be closed. However, even a positive post can be closed if it doesn’t contain enough content.

As for the features that didn’t play major roles, we believe many of them were weak due to the insignificant numbers that were seen. Our dataset only contained approximately one-thousand posts with code fragments. This was less than one percent of the data overall. Furthermore, some students do not seem to use common keywords when posting homework questions. Our dataset had a little less than 3000 posts with homework and related terms in the body, and less than 300 in the title.

Future Work

Refine our features to detect code fragments without the code tags. From this we can not only determine the impact of the different types of ways of posting code, but we can also determine if having code in a post plays a key role.

Determine how effective the parts of speech breakdown are versus using the number of words in the body and title. As we saw, most of the key features were related to post body composition. Returning these to a singular value may reveal just how useful the breakdown was in determining the post status.

Finally, our analysis of both sentiment and parts of speech included code fragments. Excluding code from these features could further refine how much of an impact they have.

Conclusion

In this study, we examined several features of Stack Overflow questions and created a model that can predict the open status of a question with 70% accuracy. We have also identified several key features of a Stack Overflow post that play a role in its open status. Our results are comparable to previous work [7] in this domain, but there is still room for improvement. We utilized the random forest algorithm to build or model and for prediction, however, it could be beneficial to examine the effects of other models, such as support vector machines (SVM) or gradient boosting trees (GBT) to see if these methods would offer any additional improvement in classification.

References

[1] Stack Overflow. (n.d.). Retrieved May 2, 2017, from https://stackoverflow.com/

[2] Reddit. (n.d.). Retrieve May 2, 2017, from https://www.reddit.com

[3] Biostars. (n.d.). Retrieve May 2, 2017, from https://www.biostars.org

[4] Quora. (n.d.). Retrieve May 2, 2017, from https://www.quora.com

[5] Kaggle. (n.d.). Retrieve May 2, 2017, from https://www.kaggle.com/c/predict-closed-questions-on-stack-overflow

[6] Han, H., Guo, X., & Yu, H. (2016, August). Variable selection using Mean Decrease Accuracy and Mean Decrease Gini based on Random Forest. In Software Engineering and Service Science (ICSESS), 2016 7th IEEE International Conference on (pp. 219-224). IEEE.

[7] Correa, D., & Sureka, A. (2013, October). Fit or unfit: analysis and prediction of’closed questions’ on stack overflow. In Proceedings of the first ACM conference on Online social networks (pp. 201-212). ACM.

[8] Correa, D., & Sureka, A. (2014, April). Chaff from the wheat: characterization and modeling of deleted questions on stack overflow. In Proceedings of the 23rd international conference on World wide web (pp. 631-642). ACM.

[9] Xia, X., Lo, D., Correa, D., Sureka, A., & Shihab, E. (2016, June). It takes two to tango: Deleted stack overflow question prediction with text and meta features. In Computer Software and Applications Conference (COMPSAC), 2016 IEEE 40th Annual (Vol. 1, pp. 73-82). IEEE.

[10] Ponzanelli, L., Mocci, A., Bacchelli, A., Lanza, M., & Fullerton, D. (2014, September). Improving low quality stack overflow post detection. In Software Maintenance and Evolution (ICSME), 2014 IEEE International Conference on (pp. 541-544). IEEE.

[11] Ponzanelli, L., Mocci, A., Bacchelli, A., & Lanza, M. (2014, October). Understanding and classifying the quality of technical forum questions. In Quality Software (QSIC), 2014 14th International Conference on (pp. 343-352). IEEE.