Introduction and Motivation

Introduction. Stack Overflow is a question and answer site that was founded in 2008. With over 40 million unique visitors each month, it remains one of the premiere resources for students, researchers, and industry professionals for questions and answers related to software engineering [1]. Like other QA sites [2, 3, 4], Stack Overflow utilizes a crowdsourcing framework for content production and site moderation. Stack Overflow strides to provide the software engineering community with the highest quality questions and answers to common and complex programming tasks. This is accomplished through a ladder system in which users who ask good questions are awarded points and users who ask questions classified as (1) off topic, (2) not constructive, (3) not a real question, or (4) too localized lose points, and risk having their question closed. This applies to answers too. Multiple answers can be given for a single question, but only the best answer will be marked as such. And with 6% of all Stack Overflow questions being marked as closed [5], a natural question to ask is why? What features of a Stack Overflow question make it good or bad? Answering these questions is the primary goal of this research project.

Motivation. As Stack Overflow is moderated by users, it can take time to close posts that are off topic, not constructive, not a real question, or too localized. These posts also lead to wasted efforts, not only by the poster, but also the moderators and other community members who take the time to read, understand, and answer these questions. To combat the problem of poor quality questions, questions could be tagged and closed before any human effort is exerted. To do this, we can use the meta-data of the poster and the question itself to build a classifier that will be able to tag these questions appropriately.

Research Question

Problem Statement: Build a classifier that predicts whether a question will remain open or closed based on different features of a Stack Overflow question. The original problem statement and datasets used can be found on Kaggle.

RQ1: What are the main features that indicate whether a new post will remain open or become closed?

RQ2: Why do these features play the roles they play?

Tools and Core Libraries

RStudio (1.0.136)
- Development and data visualization
NetBeans (8.2)
- Development and data preprocessing
StandfordCoreNLP (3.7.0)
- Core natural language software
randomForest (4.6-12)
- Classification
- Supervised learning algorithm
sentimentR (1.0.0)
- Sentiment analysis
ggplot2 (2.2.1)
- Data visualization

Dataset

The dataset used to investigate our research questions was a 189-megabyte (MB) training dataset set called train-sample_October_9_2012_v2.csv. This dataset contains 178,352 observations with a date range from August 2008 to October 2012. The main difference between this dataset and the dataset presented during our previous project update is that the current dataset contains an equal number of open and closed posts.

As before, let’s see what we have to work with:

library(ggplot2)
library(ggthemes)
library(corrplot)
library(scales)
library(dplyr)
library(data.table)

# Read in data
train <- data.table(read.csv('train_October_9_2012.csv', stringsAsFactors = F))

#Summarize content in data table
str(train)

The str function gives us some nice summary statistics about the types of features that are in our data.table.

Classes ‘data.table’ and 'data.frame':  178352 obs. of  36 variables:
 $ X                                  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ PostId                             : int  10035653 8922537 5962216 10070625 ...
 $ PostCreationDate                   : chr  "2012-04-05 20:37:35" "2012-01-19 07:38:27" ...
 $ OwnerUserId                        : int  1159226 1157921 696219 490895 ...
 $ OwnerCreationDate                  : chr  "2012-01-19 18:46:16" "2012-01-19 07:31:34" ...
 $ ReputationAtPostCreation           : int  1 1 40 1 28 50 10 1 2422 38 ...
 $ OwnerUndeletedAnswerCountAtPostTime: int  0 0 2 1 0 13 0 0 91 0 ...
 $ Title                              : chr  "what is the best way to connect my" ...
 $ BodyMarkdown                       : chr  "I know this question can be answered by" ...
 $ Tag1                               : chr  "c++" "php" "iphone-sdk-4.0" "linux" ...
 $ Tag2                               : chr  NA "xml" NA "module" ...
 $ Tag3                               : chr  NA "cakephp" NA "kernel" ...
 $ Tag4                               : chr  NA "zip" NA NA ...
 $ Tag5                               : chr  NA NA NA NA ...
 $ PostClosedDate                     : chr  "2012-04-05 23:31:34" "2012-01-19" ...
 $ OpenStatus                         : chr  "too localized" "not a real question" "open" ...
 $ TimeBetweenJoiningAndPosting       : int  11 0 5 75 12 90 1 0 74 2 ...
 $ NumberOfWordsInTitle               : int  11 12 5 4 16 9 8 7 7 6 ...
 $ NumberOfQuestionsInTitle           : int  1 0 0 0 1 0 0 0 1 0 ...
 $ NumberOfWordsInBody                : int  197 78 86 67 26 69 68 45 72 151 ...
 $ NumberOfTags                       : int  1 4 1 3 2 3 2 2 3 2 ...
 $ NumberOfQuestionsInBody            : int  3 0 2 1 1 1 1 1 1 1 ...
 $ TagsContainHomework                : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PostBodyContainsCodeFragment       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ TitleNoun                          : int  3 5 3 1 6 5 3 2 3 2 ...
 $ TitleAdverb                        : int  0 0 0 0 2 0 0 1 1 0 ...
 $ TitleVerb                          : int  2 2 1 2 4 0 1 1 3 1 ...
 $ TitleAdjective                     : int  1 2 0 0 0 1 0 1 0 1 ...
 $ BodyNoun                           : int  51 27 24 17 13 23 23 12 13 48 ...
 $ BodyAdverb                         : int  21 3 6 3 4 12 10 7 9 25 ...
 $ BodyVerb                           : int  47 16 18 20 5 16 13 9 24 31 ...
 $ BodyAdjective                      : int  14 1 6 4 2 4 4 3 2 6 ...
 $ HomeworkInTitle                    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ HomeworkInBody                     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ SentimentTitle                     : num  0.1508 0.1155 0 0 0.0625 ...
 $ SentimentBody                      : num  0.2242 0.0255 -0.0113 -0.1511 -0.01

From this output, we can see that we have 178,352 observations and 36 predictor variables. The features in this dataset include the original features, as well as the engineered features we created as a result of our data exploration presented in project update two, using the StandfordCoreNLP and sentimentR libraries.

Below you will find the name of each feature, its function, range of values, and whether or not it was used in the classifier. Note, additional features (i.e., Day, Month, Year) were added to this dataset as a result of further data exploration, which we will show in the data exploration section of this document.

Variable Name	Description	Range	Classifier
PostId	Unique post identifier	[17, 12810064]	No
PostCreationDate	Post Date	[8-1-08, 10-9-12]	No
OwnerUserId	Unique id of user	[2, 1766790]	No
OwnerCreationDate	User age	[7-31-08, 10-9-12]	No
ReputationAtPostCreation	Reputation of user	[-32, 155182]	No
OwnerUndeletedAnswerCountAtPostTime	Number of undeleted posts	[0, 5031]	No
Title	Question title	NA	No
BodyMarkdown	Question content	NA	No
Tag1	Domain of question	NA	No
Tag2	Same (optional)	NA	No
Tag3	Same (optional)	NA	No
Tag4	Same (optional)	NA	No
Tag5	Same (optional)	NA	No
PostClosedDate	Post close date	[8-22-08, 10-25-12]	No
OpenStatus	Post open status	[0, 1]	Yes
TimeBetweenJoiningAndPosting	User join date - post date	[0, 218]	Yes
NumberOfWordsInTitle	# of words in title	[1, 40]	Yes
NumberOfQuestionsInTitle	# of question marks in title	[0, 6]	Yes
NumberOfWordsInBody	# of words in body	[1, 12066]	No
NumberOfQuestionsInBody	# of question marks in body	[0, 357]	Yes
NumberOfTags	# of tags used	[1, 5]	Yes
TagsContainHomework	Is homework related?	[0, 1]	Yes
PostBodyContainsCodeFragment	Body contains markup	[0, 1]	Yes
TitleNoun	# of nouns in title	[0, 21]	Yes
TitleAdverb	# of adverbs in title	[0, 14]	Yes
TitleVerb	# of verbs in title	[0, 10]	Yes
TitleAdjective	# of adjectives in title	[0, 11]	Yes
BodyNoun	# of nouns in body	[0, 4518]	Yes
BodyAdverb	# of adverbs in body	[0, 2999]	Yes
BodyVerb	# of verbs in body	[0, 843]	Yes
BodyAdjective	# of adjectives in body	[0, 2669]	Yes
HomeworkInTitle	Is homework related?	[0, 1]	Yes
HomeworkInBody	Is homework related?	[0, 1]	Yes
SentimentTitle	Sentiment of title text	[-2.26, 2.19]	Yes
SentimentBody	Sentiment of body text	[-379.12, 2285.32]	Yes
MembershipStatus	Milestone status	[1, 4]	Yes
Day	Day of week	[0, 6]	Yes
Month	Month of year	[0, 11]	Yes
Year	Year	[2008, 2012]	Yes
TagPopularity	Tag popularity	[1, 3]	Yes

Data Exploration

Open Status

Currently, our OpenStatus variable is a factor with five levels (Open, Too localized, Not constructive, Not a real question, Off topic). Since we are dealing with a classification problem, let’s turn it into a binary variable (0 = closed, 1 = open) with two levels. To be clear, a question that is marked as closed means that the question fell into one of the four negative categories listed above. It does not mean that the question was answered and then marked as closed. Furthermore, a question that is marked as open, may or may not have an answer, but remains open because of its potential usefulness to the Stack Overflow community.

# Turn OpenStatus feature into variable with four levels, to a variable with two levels
train[, OpenStatus := ( ifelse(PostClosedDate == "", 1, 0))]
# View summary statistics for this variable
table(train$OpenStatus)

0     1 
89176 89176

Reputation at Post Creation

The first thing that caught our attention was the ReputationAtPostCreation variable. Let’s see what the distributions of reputation are:

ggplot(train, aes(x=ReputationAtPostCreation)) + 
  geom_density() +
  labs(x = 'User Reputation Points', y = 'Density', title = 'Reputation Density Plot') +
  theme(plot.title = element_text(hjust = 0.5))

It looks like that a clear majority of users who ask questions have reputations below 200. Since this data is so skewed, it could be worth log transforming to try and obtain something more along the lines of a normal distribution, but let’s do something else. Let’s create a new feature that encapsulates the user’s experience. We’ll do this by using the Stack Overflow milestones as our guide.

train$MembershipStatus[train$ReputationAtPostCreation >= 20000] <- 'Trusted'
train$MembershipStatus[train$ReputationAtPostCreation < 20000 & train$ReputationAtPostCreation >= 1000] <- 'Established'
train$MembershipStatus[train$ReputationAtPostCreation < 1000 & train$ReputationAtPostCreation >= 200] <- 'Avid'
train$MembershipStatus[train$ReputationAtPostCreation < 200] <- 'New'

Okay, cool. We’ve created a new feature called MembershipStatus. Let’s see how it relates to our outcome variable (OpenStatus).

ggplot(train, aes(x = MembershipStatus, fill = factor(OpenStatus))) +
  geom_bar(stat='count', position='dodge') +
  labs(x = 'StackOverflow Status', y = 'Count', title = 'Number of Open and Closed Posts by User Status') +
  guides(fill=guide_legend(title='Open Status')) +
  scale_fill_hue(labels=c('Closed', 'Open')) +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5))

As expected, new users have more questions closed than avid users, avid users have more questions closed than established users, and established users have more questions closed than trusted users.

Post Creation Date

Currently, the PostCreationDate feature is in a month-day-year format. This isn’t very useful on its own, so let’s split this variable into three separate features and see how they might relate to the outcome variable.

Day

dates <- gsub( " .*$", "", train$PostCreationDate)
# Convert date format to day of week
train$Day <- weekdays(as.Date(dates, format = "%m-%d-%Y"))
# Order days of week appropriately
train$Day <- factor(train$Day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"), ordered = TRUE)

ggplot(train, aes(x = Day, fill = factor(OpenStatus))) +
  geom_bar(stat='count', position='dodge') +
  labs(x = 'Day of Week', y = 'Count', title = 'Number of Open and Closed Posts by Day') +
  guides(fill=guide_legend(title='Open Status')) +
  scale_fill_hue(labels=c('Closed', 'Open')) +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5))

It looks like there is a slight penalty for asking questions on the weekends. Let’s create a new feature called isWeekend to represent this. Note, because of time, we did not end up using this feature in our final analysis.

train[, isWeekend := ( ifelse(Day == 'Saturday' | Day == 'Sunday', 1, 0))]

Month

# Convert date format to months
train$Month <- months(as.Date(dates, format = "%m-%d-%Y"))
# Order months appropriately
train$Month <- factor(train$Month, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"), ordered = TRUE)

ggplot(train, aes(x = Month, fill = factor(OpenStatus))) +
  geom_bar(stat='count', position='dodge') +
  labs(x = 'Month', y = 'Count', title = 'Number of Open and Closed Posts by Month') +
  guides(fill=guide_legend(title='Open Status')) +
  scale_fill_hue(labels=c('Closed', 'Open')) +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5))

An interesting trend is shown between the months of July, August, September and October. More questions get asked and more questions get closed. Perhaps this is due to the beginning of the Fall semester?

Year

# Convert date format to years
train$Year <- year(as.Date(dates, format = "%m-%d-%Y"))

ggplot(train, aes(x = Year, fill = factor(OpenStatus))) +
  geom_bar(stat='count', position='dodge') +
  labs(x = 'Year', y = 'Count', title = 'Number of Open and Closed Posts by Year') +
  guides(fill=guide_legend(title='Open Status')) +
  scale_fill_hue(labels=c('Closed', 'Open')) +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5))

As the popularity of Stack Overflow has increased, more people are asking questions and more questions are being closed.

Owner Creation Date

We can get some interesting information from the OwnerCreationDate feature. Let’s use the year the user created an account to get his or her account age on Stack Overflow.

birthdayDates <- gsub( " .*$", "", train$OwnerCreationDate)
birth <- year(as.Date(birthdayDates, format = "%m-%d-%Y"))
age <- sapply(birth, function(x) 2012-x)
train$Age <- age

ggplot(train, aes(x = Age, fill = factor(OpenStatus))) +
     geom_bar(stat='count', position='dodge') +
     labs(x = 'Account Age (in years)', y = 'Count', title = 'Number of Open and Closed Posts by Account Age') +
     guides(fill=guide_legend(title='Open Status')) +
     scale_fill_hue(labels=c('Closed', 'Open')) +
     theme_few() +
     theme(plot.title = element_text(hjust = 0.5))

Parts of Speech and Sentiment

As mentioned previously, we also engineered features related to the parts of speech and sentiment of a post. In this context, parts of speech refer to the number of nouns, verbs, adverbs, and adjectives in the title and body of a post. Sentiment refers to the tone or attitude of the post as represented in the title or body of the post and takes on a value between less than zero and greater than one, where a sentiment less than one is considered negative, a sentiment near zero being neutral, and a sentiment greater than zero being positive. Let’s get an idea about how these features relate to the outcome variable:

library(corrplot)
cor_data = cor(train)
corrplot(cor_data, method="square")

Cool, it looks like the parts of speech features we engineered have a slight to moderate correlation to our outcome variable. Unfortunately, the sentiment features we engineered appear to have a negative correlation, so we will not pursue them in our data exploration. Let’s take a closer look at the parts of speech features.

BodyLength and TitleLength

b_words <- ggplot(train, aes(x = MembershipStatus, y = NumbersOfWordsInBody, fill = factor(OpenStatus))) +
  stat_summary(fun.y="mean", geom="bar", position='dodge') + 
  labs(x = 'StackOverflow Status', y = 'Mean Number of Words in Body', title = 'Avg(Words) in Body Based on User Status') +
  guides(fill=guide_legend(title='Open Status')) +
  scale_fill_hue(labels=c('Closed', 'Open')) +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5))
  
t_words <- ggplot(train, aes(x = MembershipStatus, y = NumberOfWordsInTitle, fill = factor(OpenStatus))) +
  stat_summary(fun.y="mean", geom="bar", position='dodge') + 
  labs(x = 'StackOverflow Status', y = 'Mean Number of Words in Title', title = 'Avg(Words) in Title Based on User Status') +
  guides(fill=guide_legend(title='Open Status')) +
  scale_fill_hue(labels=c('Closed', 'Open')) +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5))
  
grid.arrange(b_words, t_words, ncol = 2)

BodyNouns and TitleNouns

b_words <- ggplot(train, aes(x = MembershipStatus, y = BodyNoun, fill = factor(OpenStatus))) +
  stat_summary(fun.y="mean", geom="bar", position='dodge') + 
  labs(x = 'StackOverflow Status', y = 'Mean Number of Nouns in Body', title = 'Avg(Nouns) in Body Based on User Status') +
  guides(fill=guide_legend(title='Open Status')) +
  scale_fill_hue(labels=c('Closed', 'Open')) +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5))
  
t_words <- ggplot(train, aes(x = MembershipStatus, y = TitleNoun, fill = factor(OpenStatus))) +
  stat_summary(fun.y="mean", geom="bar", position='dodge') + 
  labs(x = 'StackOverflow Status', y = 'Mean Number of Nouns in Title', title = 'Avg(Nouns) in Title Based on User Status') +
  guides(fill=guide_legend(title='Open Status')) +
  scale_fill_hue(labels=c('Closed', 'Open')) +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5))
  
grid.arrange(b_nouns, t_nouns, ncol = 2)

Removing Collinear Features

Before building our model, we should remove some of the features that are collinear, as they can affect the performance of our model. Let’s see which features are collinear by using a 75% threshold and remove them:

result = findCorrelation(cor_data, cutoff=0.75)
cor_data[, result]

library(knitr)
library(data.table)
corr_tbl <- data.table(read.csv('corr.csv', sep = '\t', stringsAsFactors = T))
kable(corr_tbl)

X	NumberOfWordsInBody	BodyNoun	BodyAdjective	ReputationAtPostCreation
ReputationAtPostCreation	0.0145523	-0.0045167	0.0032893	1.0000000
TimeBetweenJoiningAndPosting	0.0186692	-0.0049861	-0.0040282	0.2998619
NumberOfWordsInTitle	0.0422945	0.0191837	0.0191390	0.0270069
NumberOfQuestionsInTitle	-0.0587115	-0.0728418	-0.0498797	0.0739489
NumberOfQuestionsInBody	0.2312444	0.2017594	0.2241463	0.0220877
TagsContainHomework	0.0356223	0.0292592	0.0273377	-0.0084125
PostBodyContainsCodeFragment	0.0441115	0.0499384	0.0543177	-0.0039682
TitleNoun	0.0649369	0.0543866	0.0281176	-0.0056978
TitleAdverb	0.0274721	0.0242831	0.0257428	0.0162788
TitleVerb	0.0324711	0.0199504	0.0193019	0.0273074
TitleAdjective	0.0309499	0.0036676	0.0365767	0.0302319
BodyNoun	0.8950727	1.0000000	0.7921082	-0.0045167
BodyAdverb	0.7330888	0.8121151	0.7610987	-0.0009667
BodyVerb	0.8079848	0.6226818	0.5231279	0.0343449
BodyAdjective	0.7487115	0.7921082	1.0000000	0.0032893
HomeworkInTitle	0.0120184	0.0043175	0.0056380	-0.0017060
HomeworkInBody	0.0657162	0.0429439	0.0387880	-0.0077251
SentimentTitle	-0.0623379	-0.0878986	-0.0626156	0.0108036
SentimentBody	0.0045638	0.0077567	0.0026799	-0.0007069
Year	0.0442817	0.0597827	0.0441286	-0.0358436
NumberOfTags	0.1322587	0.0849063	0.0651901	0.0285871
DayOfWeek	0.0038018	0.0012470	0.0070569	-0.0034225
MonthNumber	-0.0033319	-0.0028840	0.0007746	-0.0037980
MembershipStatus	0.0001896	-0.0210451	-0.0112578	0.2736178
PopularityNumber	-0.0252074	-0.0291668	-0.0291960	0.0031493
OwnerUndeletedAnswerCountAtPostTime	0.0188627	-0.0029363	0.0052784	0.9117019
NumberOfWordsInBody	1.0000000	0.8950727	0.7487115	0.0145523
OpenStatus	0.0988113	0.0800382	0.0609730	0.0615442

We can see that OwnerUndeletedAnswerCountAtPostTime is highly correlated with ReputationAtPostCreation, and that NumberOfWordsInBody is highly correlated with both BodyNoun and BodyVerb. Let’s remove the OwnerUndeletedAnswerCountAtPostTime and NumberOfWordsInBody features.

train$OwnerUndeletedAnswerCountAtPostTime <- NULL
train$NumberOfWordsInBody <- NULL

Results

Model

Now that we have engineered several features, we will now create a model to classify posts based on these features. First, we want to partition off some of our data for a final set of testing data. We used a 70-30% split to partition our dataset. This allows us to train the model on 70% of the data, then see how well it works with the remaining 30%.

t_idx = sample(seq_len(nrow(traindata)),size = floor(0.7 * nrow(traindata)))
# 70% to train
train = traindata[t_idx,]
# 30% to test
test = traindata[-t_idx,]
nrow(train)
nrow(test)

# Training size
124,846
# Test size
53,506

Now, let’s build our model. We’ll do this by using the randomForest package in R. First, let’s do a vanilla run to benchmark our model. We will build the model using an ntree value of 500, which specifies the number of trees to use and set the importance variable to true so that we can get the resulting feature importance statistics.

rf <- randomForest(formula = as.factor(OpenStatus) ~ ., data = train, importance = T, ntree = 500)
rf

Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of  error rate: 29.48%

Confusion matrix:
      0     1   class.error
0 43873 18709   0.2989518
1 18095 44169   0.2906174

Not bad. We received an out-of-bag (OOB) error of 29.48%, which put very simply is the prediction error of our model. We can also see from the confusion matrix our performance in classifying each of our observations based on the two levels of our outcome variable. It looks like we are equally good at predicting both open and closed posts. Let’s look at this graphically:

plot(rf, ylim=c(0, 0.40))

From this graph, we see exactly what was shown in the confusion matrix. The black line is our OOB error (29.48), the red line is our error rate for closed questions (29.89), and the green line is our error rate for open questions (29.06).

We can also see how the number of trees used to construct the classifier impacts the resulting error rate. As the number of trees used approaches 500, we can see the error rate beginning to become parallel with the x-axis, which tells us: (1) 500 was a pretty good value for the number of trees and (2) increasing the number of trees (past 500) will not improve the error rate.

Let’s now look at the importance of each feature:

varImpPlot(rf)

From these graphs, we can see that the top features are:

BodyNoun
SentimentBody
BodyAdverb
BodyVerb
ReputationAtPostCreation
Year

We also made several efforts to improve the accuracy of our model. Our first method involved tweaking the input parameters to the randomForest algorithm. The parameters used included:

mtry (Number of features tried at each node split)
nodesize (Minimum node size)
ntrees (Number of trees to build)

We found that an mtry value of 5, a nodesize of 12, and ntrees value of 500 provided marginally better results than our benchmark.

rf_best <- randomForest(formula = as.factor(OpenStatus) ~ ., data = train, importance = T, ntree = 500, nodesize = 12)
rf_best

Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of  error rate: 29.35%

Confusion matrix:
      0     1   class.error
0 43900 18682   0.2985203
1 17966 44298   0.2885455

From this graph, we can see a slight improvement in our OOB error (29.35), little to no improvement in our error rate for closed questions (29.85), and a slight improvement with our error rate for open questions (28.85). Let’s see if the variable importance statistics changed at all:

varImpPlot(rf_best)

It does not look like much has changed. The Year feature seems to have climbed a few ranks in the MeanDecreaseGini graph, but other than that, not much has changed after tuning the parameters.

Our second attempt at improving our model involved using a smaller number of features. According to Han, Guo, and Yu [6], less important features can affect model performance when using Random Forests methods. Thus, they suggest the following algorithm:

In Step 1:

Run the random forest algorithm
Rank every feature using the MeanDecreaseAccuracy (MDA) and MeanDecreaseGini (MDG)

In Step 2:

Subset the top 50 percent of highest scoring features
Run random forest using this subset of features and inspect the error rate

If the error rate continues to decrease during each iteration, then we follow the method again, and keep doing it until the error rate increases, or there are no more features left [6]. Let’s try the algorithm using the same parameters used in our previous model and see what happens:

rf_han_guo_yu = randomForest(as.factor(OpenStatus) ~ BodyNoun + BodyAdverb + SentimentBody
                  + BodyVerb + ReputationAtPostCreation + Year 
                  + TimeBetweenJoiningAndPosting + SentimentTitle + BodyAdjective
                  + MonthNumber + NumberOfWordsInTitle + TitleNoun + DayOfWeek,
                  data = train,
                  importance = T,
                  ntree = 500,
                  mtry = 5,
                  nodesize = 12)
rf_han_guo_yu

Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of  error rate: 30.5%

Confusion matrix:
      0     1   class.error
0 43236 19346   0.3091304
1 18738 43526   0.3009444

Oh, no! Our error rate increased. It went from 29.35 when considering all features to 30.50 when considering only half of them. I think we’ll stop here, as removing any more features will likely continue to increase our error rate.

Prediction

Earlier, we split our data into a 70/30 training set and used 70% of the data to train our model. Now let’s do some predictions using the remaining 30% of test data.

prediction <- predict(rf_best, newdata = test[-26])
prediction_table = table(prediction, test$OpenStatus);
prediction_table

prediction_table      0     1
                0 18552  7804
                1  8042 19108

One question we may want to ask about this confusion matrix is how precise is our classifier? In other words, how many observations did we predict correctly? We can calculate this using a simple formula: \[P = \frac{TP}{TP + FP}\]

precision = prediction_table[2,2]/(sum(prediction_table[2,1:2]))
precision

0.7037937

How about recall? Recall is the proportion of all real positive observations that are correct. We can calculate this using the following formula: \[R = \frac{TP}{TP + FN}\]

recall = prediction_table[2,2]/(sum(prediction_table[1:2,2]))
recall

0.7100178

Let’s also look at the F-Score, which considers both precision and recall and can be calculated as follows: \[F = 2\frac{P*R}{P+R}\]

f1Score = (2*precision*recall)/(precision+recall)
f1Score

0.7068921

We can also look at the overall model accuracy by considering the number of correct predictions over all predictions: \[A = \frac{TP+TN}{TP+TN+FP+FN}\]

accuracy = (prediction_table[2,2] + prediction_table[1,1])/(sum(prediction_table[1:2,1:2]))
accuracy

0.7038463

Discussion

RQ1: What are the main features that indicate whether a new post will remain open or become closed? As was seen in the model results, the major factors playing a role in the model are the composition and sentiment of the body. These factors were significant in determining whether a post will remain open or closed. Unfortunately, some of the variables we had hoped to be strong features, such as the presence of homework in the title or body, membership status, and whether the post had code fragments were not very important.

RQ2: Why do these features play the roles they play? When considering the post composition, the number of nouns, verbs, and so on make sense as key factors. Having too large or small of a post can impact the ability of the post to convey a meaningful question. Further, the sentiment of the post can also play a key role. If the post contains a negative sentiment, it may convey an unintended message, meaning the post will be marked to be closed. However, even a positive post can be closed if it doesn’t contain enough content.

As for the features that didn’t play major roles, we believe many of them were weak due to the insignificant numbers that were seen. Our dataset only contained approximately one-thousand posts with code fragments. This was less than one percent of the data overall. Furthermore, some students do not seem to use common keywords when posting homework questions. Our dataset had a little less than 3000 posts with homework and related terms in the body, and less than 300 in the title.

Future Work

Refine our features to detect code fragments without the code tags. From this we can not only determine the impact of the different types of ways of posting code, but we can also determine if having code in a post plays a key role.

Determine how effective the parts of speech breakdown are versus using the number of words in the body and title. As we saw, most of the key features were related to post body composition. Returning these to a singular value may reveal just how useful the breakdown was in determining the post status.

Finally, our analysis of both sentiment and parts of speech included code fragments. Excluding code from these features could further refine how much of an impact they have.

Related Work

Analysis and Prediction of ‘Closed Questions’ on Stack Overflow. Using a similar dataset, Correa and Sureka [7] developed a classifier to predict the closed status of Stack Overflow questions. To build their classifier, they collected and engineered features along four dimensions: (1) User Profile, (2) Community Process, (3) Question Content, and (4) Textual Style. Several of these features (i.e., badge score, number of punctuation marks, number of lowercase characters) differed from our features in that they examined general characteristics of a post, while we examined very specific aspects of each post, such as parts of speech and question sentiment. For classification, they used a variety of different algorithms, including Support Vector Machines (SVM), Naive Bayes (NB), Logistic Regression (LG), and Stochastic Gradient Boosting (SGB) methods [7]. We show, using the RandomForest classifier, comparable results.

Characterization and Modeling of Deleted Questions on Stack Overflow. Future work by Correa and Sureka [8] looked at a similar classification problem using the post creation time to predict whether a posted question would be closed. This study examined the nature of deleted posts to determine qualities that indicate whether a post will be deleted. As before, they collected general features related to the post, community information, such as number of views, and user profile data. They also examined the content of the posts, considering parts of speech and sentiment. Using a Decision Tree classifier trained on 47 different features, Correa and Sureka trained a predictive model (with 66% accuracy) that correctly predicted whether a Stack Overflow question would be deleted.

Deleted Stack Overflow Question Prediction with Text and Meta Features. Similar studies have also been conduected by Xia et al. [9], who developed a classifier to predict deleted posts based on features related to parts of speech found in the title and body of a Stack Overflow post. Using a number of different classification approaches, they were able to develop a two-stage predictor model that utilized textual and meta feature predictors and show significant improvements in classification compared to previously published methods [8]. In our approach, we use similar features to determine the quality of a Stack Overflow question by considering parts of speech and post sentiment.

Low Quality Stack Overflow Post Detection. In addition to the crowdsourced moderation methods Stack Overflow uses to filter on question quality, they also utilize an automation system to flag potential low quality questions. However, the automation approach still requires significant moderator intervention. Efforts to improve the latter system were investigated by Ponzanelli et al. [10] using a linear quality function (QF) to rate the quality of a question at the time of its creation [11]. Using this method, in addition to available question popularity metrics (i.e., upvotes and downvotes), the authors were able to reduce the number of moderator interventions needed to filter on low quality and high quality Stack Overflow questions. Similar to this work, our work aims at improving the prediction quality of Stack Overflow questions.

Conclusion

In this study, we examined several features of Stack Overflow questions and created a model that can predict the open status of a question with 70% accuracy. We have also identified several key features of a Stack Overflow post that play a role in its open status. Our results are comparable to previous work [7] in this domain, but there is still room for improvement. We utilized the random forest algorithm to build or model and for prediction, however, it could be beneficial to examine the effects of other models, such as support vector machines (SVM) or gradient boosting trees (GBT) to see if these methods would offer any additional improvement in classification.

References

[1] Stack Overflow. (n.d.). Retrieved May 2, 2017, from https://stackoverflow.com/

[2] Reddit. (n.d.). Retrieve May 2, 2017, from https://www.reddit.com

[3] Biostars. (n.d.). Retrieve May 2, 2017, from https://www.biostars.org

[4] Quora. (n.d.). Retrieve May 2, 2017, from https://www.quora.com

[5] Kaggle. (n.d.). Retrieve May 2, 2017, from https://www.kaggle.com/c/predict-closed-questions-on-stack-overflow

[6] Han, H., Guo, X., & Yu, H. (2016, August). Variable selection using Mean Decrease Accuracy and Mean Decrease Gini based on Random Forest. In Software Engineering and Service Science (ICSESS), 2016 7th IEEE International Conference on (pp. 219-224). IEEE.

[7] Correa, D., & Sureka, A. (2013, October). Fit or unfit: analysis and prediction of’closed questions’ on stack overflow. In Proceedings of the first ACM conference on Online social networks (pp. 201-212). ACM.

[8] Correa, D., & Sureka, A. (2014, April). Chaff from the wheat: characterization and modeling of deleted questions on stack overflow. In Proceedings of the 23rd international conference on World wide web (pp. 631-642). ACM.

[9] Xia, X., Lo, D., Correa, D., Sureka, A., & Shihab, E. (2016, June). It takes two to tango: Deleted stack overflow question prediction with text and meta features. In Computer Software and Applications Conference (COMPSAC), 2016 IEEE 40th Annual (Vol. 1, pp. 73-82). IEEE.

[10] Ponzanelli, L., Mocci, A., Bacchelli, A., Lanza, M., & Fullerton, D. (2014, September). Improving low quality stack overflow post detection. In Software Maintenance and Evolution (ICSME), 2014 IEEE International Conference on (pp. 541-544). IEEE.

[11] Ponzanelli, L., Mocci, A., Bacchelli, A., & Lanza, M. (2014, October). Understanding and classifying the quality of technical forum questions. In Quality Software (QSIC), 2014 14th International Conference on (pp. 343-352). IEEE.

Predict Closed Questions on Stack Overflow

Chris Dean and Roger Marquez

May 11, 2017