Introduction. Stack Overflow is a question and answer site that was founded in 2008. With over 40 million unique visitors each month, it remains one of the premiere resources for students, researchers, and industry professionals for questions and answers related to software engineering [1]. Like other QA sites [2, 3, 4], Stack Overflow utilizes a crowdsourcing framework for content production and site moderation. Stack Overflow strides to provide the software engineering community with the highest quality questions and answers to common and complex programming tasks. This is accomplished through a ladder system in which users who ask good questions are awarded points and users who ask questions classified as (1) off topic, (2) not constructive, (3) not a real question, or (4) too localized lose points, and risk having their question closed. This applies to answers too. Multiple answers can be given for a single question, but only the best answer will be marked as such. And with 6% of all Stack Overflow questions being marked as closed [5], a natural question to ask is why? What features of a Stack Overflow question make it good or bad? Answering these questions is the primary goal of this research project.
Motivation. As Stack Overflow is moderated by users, it can take time to close posts that are off topic, not constructive, not a real question, or too localized. These posts also lead to wasted efforts, not only by the poster, but also the moderators and other community members who take the time to read, understand, and answer these questions. To combat the problem of poor quality questions, questions could be tagged and closed before any human effort is exerted. To do this, we can use the meta-data of the poster and the question itself to build a classifier that will be able to tag these questions appropriately.
Problem Statement: Build a classifier that predicts whether a question will remain open or closed based on different features of a Stack Overflow question. The original problem statement and datasets used can be found on Kaggle.
RQ1: What are the main features that indicate whether a new post will remain open or become closed?
RQ2: Why do these features play the roles they play?
The dataset used to investigate our research questions was a 189-megabyte (MB) training dataset set called train-sample_October_9_2012_v2.csv. This dataset contains 178,352 observations with a date range from August 2008 to October 2012. The main difference between this dataset and the dataset presented during our previous project update is that the current dataset contains an equal number of open and closed posts.
As before, let’s see what we have to work with:
library(ggplot2)
library(ggthemes)
library(corrplot)
library(scales)
library(dplyr)
library(data.table)
# Read in data
train <- data.table(read.csv('train_October_9_2012.csv', stringsAsFactors = F))
#Summarize content in data table
str(train)
The str function gives us some nice summary statistics about the types of features that are in our data.table.
Classes ‘data.table’ and 'data.frame': 178352 obs. of 36 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ PostId : int 10035653 8922537 5962216 10070625 ...
$ PostCreationDate : chr "2012-04-05 20:37:35" "2012-01-19 07:38:27" ...
$ OwnerUserId : int 1159226 1157921 696219 490895 ...
$ OwnerCreationDate : chr "2012-01-19 18:46:16" "2012-01-19 07:31:34" ...
$ ReputationAtPostCreation : int 1 1 40 1 28 50 10 1 2422 38 ...
$ OwnerUndeletedAnswerCountAtPostTime: int 0 0 2 1 0 13 0 0 91 0 ...
$ Title : chr "what is the best way to connect my" ...
$ BodyMarkdown : chr "I know this question can be answered by" ...
$ Tag1 : chr "c++" "php" "iphone-sdk-4.0" "linux" ...
$ Tag2 : chr NA "xml" NA "module" ...
$ Tag3 : chr NA "cakephp" NA "kernel" ...
$ Tag4 : chr NA "zip" NA NA ...
$ Tag5 : chr NA NA NA NA ...
$ PostClosedDate : chr "2012-04-05 23:31:34" "2012-01-19" ...
$ OpenStatus : chr "too localized" "not a real question" "open" ...
$ TimeBetweenJoiningAndPosting : int 11 0 5 75 12 90 1 0 74 2 ...
$ NumberOfWordsInTitle : int 11 12 5 4 16 9 8 7 7 6 ...
$ NumberOfQuestionsInTitle : int 1 0 0 0 1 0 0 0 1 0 ...
$ NumberOfWordsInBody : int 197 78 86 67 26 69 68 45 72 151 ...
$ NumberOfTags : int 1 4 1 3 2 3 2 2 3 2 ...
$ NumberOfQuestionsInBody : int 3 0 2 1 1 1 1 1 1 1 ...
$ TagsContainHomework : int 0 0 0 0 0 0 0 0 0 0 ...
$ PostBodyContainsCodeFragment : int 0 0 0 0 0 0 0 0 0 0 ...
$ TitleNoun : int 3 5 3 1 6 5 3 2 3 2 ...
$ TitleAdverb : int 0 0 0 0 2 0 0 1 1 0 ...
$ TitleVerb : int 2 2 1 2 4 0 1 1 3 1 ...
$ TitleAdjective : int 1 2 0 0 0 1 0 1 0 1 ...
$ BodyNoun : int 51 27 24 17 13 23 23 12 13 48 ...
$ BodyAdverb : int 21 3 6 3 4 12 10 7 9 25 ...
$ BodyVerb : int 47 16 18 20 5 16 13 9 24 31 ...
$ BodyAdjective : int 14 1 6 4 2 4 4 3 2 6 ...
$ HomeworkInTitle : int 0 0 0 0 0 0 0 0 0 0 ...
$ HomeworkInBody : int 0 0 0 0 0 0 0 0 0 0 ...
$ SentimentTitle : num 0.1508 0.1155 0 0 0.0625 ...
$ SentimentBody : num 0.2242 0.0255 -0.0113 -0.1511 -0.01
From this output, we can see that we have 178,352 observations and 36 predictor variables. The features in this dataset include the original features, as well as the engineered features we created as a result of our data exploration presented in project update two, using the StandfordCoreNLP and sentimentR libraries.
Below you will find the name of each feature, its function, range of values, and whether or not it was used in the classifier. Note, additional features (i.e., Day, Month, Year) were added to this dataset as a result of further data exploration, which we will show in the data exploration section of this document.
Variable Name | Description | Range | Classifier |
---|---|---|---|
PostId | Unique post identifier | [17, 12810064] | No |
PostCreationDate | Post Date | [8-1-08, 10-9-12] | No |
OwnerUserId | Unique id of user | [2, 1766790] | No |
OwnerCreationDate | User age | [7-31-08, 10-9-12] | No |
ReputationAtPostCreation | Reputation of user | [-32, 155182] | No |
OwnerUndeletedAnswerCountAtPostTime | Number of undeleted posts | [0, 5031] | No |
Title | Question title | NA | No |
BodyMarkdown | Question content | NA | No |
Tag1 | Domain of question | NA | No |
Tag2 | Same (optional) | NA | No |
Tag3 | Same (optional) | NA | No |
Tag4 | Same (optional) | NA | No |
Tag5 | Same (optional) | NA | No |
PostClosedDate | Post close date | [8-22-08, 10-25-12] | No |
OpenStatus | Post open status | [0, 1] | Yes |
TimeBetweenJoiningAndPosting | User join date - post date | [0, 218] | Yes |
NumberOfWordsInTitle | # of words in title | [1, 40] | Yes |
NumberOfQuestionsInTitle | # of question marks in title | [0, 6] | Yes |
NumberOfWordsInBody | # of words in body | [1, 12066] | No |
NumberOfQuestionsInBody | # of question marks in body | [0, 357] | Yes |
NumberOfTags | # of tags used | [1, 5] | Yes |
TagsContainHomework | Is homework related? | [0, 1] | Yes |
PostBodyContainsCodeFragment | Body contains markup | [0, 1] | Yes |
TitleNoun | # of nouns in title | [0, 21] | Yes |
TitleAdverb | # of adverbs in title | [0, 14] | Yes |
TitleVerb | # of verbs in title | [0, 10] | Yes |
TitleAdjective | # of adjectives in title | [0, 11] | Yes |
BodyNoun | # of nouns in body | [0, 4518] | Yes |
BodyAdverb | # of adverbs in body | [0, 2999] | Yes |
BodyVerb | # of verbs in body | [0, 843] | Yes |
BodyAdjective | # of adjectives in body | [0, 2669] | Yes |
HomeworkInTitle | Is homework related? | [0, 1] | Yes |
HomeworkInBody | Is homework related? | [0, 1] | Yes |
SentimentTitle | Sentiment of title text | [-2.26, 2.19] | Yes |
SentimentBody | Sentiment of body text | [-379.12, 2285.32] | Yes |
MembershipStatus | Milestone status | [1, 4] | Yes |
Day | Day of week | [0, 6] | Yes |
Month | Month of year | [0, 11] | Yes |
Year | Year | [2008, 2012] | Yes |
TagPopularity | Tag popularity | [1, 3] | Yes |
Currently, our OpenStatus variable is a factor with five levels (Open, Too localized, Not constructive, Not a real question, Off topic). Since we are dealing with a classification problem, let’s turn it into a binary variable (0 = closed, 1 = open) with two levels. To be clear, a question that is marked as closed means that the question fell into one of the four negative categories listed above. It does not mean that the question was answered and then marked as closed. Furthermore, a question that is marked as open, may or may not have an answer, but remains open because of its potential usefulness to the Stack Overflow community.
# Turn OpenStatus feature into variable with four levels, to a variable with two levels
train[, OpenStatus := ( ifelse(PostClosedDate == "", 1, 0))]
# View summary statistics for this variable
table(train$OpenStatus)
0 1
89176 89176
The first thing that caught our attention was the ReputationAtPostCreation variable. Let’s see what the distributions of reputation are:
ggplot(train, aes(x=ReputationAtPostCreation)) +
geom_density() +
labs(x = 'User Reputation Points', y = 'Density', title = 'Reputation Density Plot') +
theme(plot.title = element_text(hjust = 0.5))
It looks like that a clear majority of users who ask questions have reputations below 200. Since this data is so skewed, it could be worth log transforming to try and obtain something more along the lines of a normal distribution, but let’s do something else. Let’s create a new feature that encapsulates the user’s experience. We’ll do this by using the Stack Overflow milestones as our guide.
train$MembershipStatus[train$ReputationAtPostCreation >= 20000] <- 'Trusted'
train$MembershipStatus[train$ReputationAtPostCreation < 20000 & train$ReputationAtPostCreation >= 1000] <- 'Established'
train$MembershipStatus[train$ReputationAtPostCreation < 1000 & train$ReputationAtPostCreation >= 200] <- 'Avid'
train$MembershipStatus[train$ReputationAtPostCreation < 200] <- 'New'
Okay, cool. We’ve created a new feature called MembershipStatus. Let’s see how it relates to our outcome variable (OpenStatus).
ggplot(train, aes(x = MembershipStatus, fill = factor(OpenStatus))) +
geom_bar(stat='count', position='dodge') +
labs(x = 'StackOverflow Status', y = 'Count', title = 'Number of Open and Closed Posts by User Status') +
guides(fill=guide_legend(title='Open Status')) +
scale_fill_hue(labels=c('Closed', 'Open')) +
theme_few() +
theme(plot.title = element_text(hjust = 0.5))
As expected, new users have more questions closed than avid users, avid users have more questions closed than established users, and established users have more questions closed than trusted users.
Currently, the PostCreationDate feature is in a month-day-year format. This isn’t very useful on its own, so let’s split this variable into three separate features and see how they might relate to the outcome variable.
dates <- gsub( " .*$", "", train$PostCreationDate)
# Convert date format to day of week
train$Day <- weekdays(as.Date(dates, format = "%m-%d-%Y"))
# Order days of week appropriately
train$Day <- factor(train$Day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"), ordered = TRUE)
ggplot(train, aes(x = Day, fill = factor(OpenStatus))) +
geom_bar(stat='count', position='dodge') +
labs(x = 'Day of Week', y = 'Count', title = 'Number of Open and Closed Posts by Day') +
guides(fill=guide_legend(title='Open Status')) +
scale_fill_hue(labels=c('Closed', 'Open')) +
theme_few() +
theme(plot.title = element_text(hjust = 0.5))
It looks like there is a slight penalty for asking questions on the weekends. Let’s create a new feature called isWeekend to represent this. Note, because of time, we did not end up using this feature in our final analysis.
train[, isWeekend := ( ifelse(Day == 'Saturday' | Day == 'Sunday', 1, 0))]
# Convert date format to months
train$Month <- months(as.Date(dates, format = "%m-%d-%Y"))
# Order months appropriately
train$Month <- factor(train$Month, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"), ordered = TRUE)
ggplot(train, aes(x = Month, fill = factor(OpenStatus))) +
geom_bar(stat='count', position='dodge') +
labs(x = 'Month', y = 'Count', title = 'Number of Open and Closed Posts by Month') +
guides(fill=guide_legend(title='Open Status')) +
scale_fill_hue(labels=c('Closed', 'Open')) +
theme_few() +
theme(plot.title = element_text(hjust = 0.5))
An interesting trend is shown between the months of July, August, September and October. More questions get asked and more questions get closed. Perhaps this is due to the beginning of the Fall semester?
# Convert date format to years
train$Year <- year(as.Date(dates, format = "%m-%d-%Y"))
ggplot(train, aes(x = Year, fill = factor(OpenStatus))) +
geom_bar(stat='count', position='dodge') +
labs(x = 'Year', y = 'Count', title = 'Number of Open and Closed Posts by Year') +
guides(fill=guide_legend(title='Open Status')) +
scale_fill_hue(labels=c('Closed', 'Open')) +
theme_few() +
theme(plot.title = element_text(hjust = 0.5))
As the popularity of Stack Overflow has increased, more people are asking questions and more questions are being closed.
We can get some interesting information from the OwnerCreationDate feature. Let’s use the year the user created an account to get his or her account age on Stack Overflow.
birthdayDates <- gsub( " .*$", "", train$OwnerCreationDate)
birth <- year(as.Date(birthdayDates, format = "%m-%d-%Y"))
age <- sapply(birth, function(x) 2012-x)
train$Age <- age
ggplot(train, aes(x = Age, fill = factor(OpenStatus))) +
geom_bar(stat='count', position='dodge') +
labs(x = 'Account Age (in years)', y = 'Count', title = 'Number of Open and Closed Posts by Account Age') +
guides(fill=guide_legend(title='Open Status')) +
scale_fill_hue(labels=c('Closed', 'Open')) +
theme_few() +
theme(plot.title = element_text(hjust = 0.5))
As mentioned previously, we also engineered features related to the parts of speech and sentiment of a post. In this context, parts of speech refer to the number of nouns, verbs, adverbs, and adjectives in the title and body of a post. Sentiment refers to the tone or attitude of the post as represented in the title or body of the post and takes on a value between less than zero and greater than one, where a sentiment less than one is considered negative, a sentiment near zero being neutral, and a sentiment greater than zero being positive. Let’s get an idea about how these features relate to the outcome variable:
library(corrplot)
cor_data = cor(train)
corrplot(cor_data, method="square")
Cool, it looks like the parts of speech features we engineered have a slight to moderate correlation to our outcome variable. Unfortunately, the sentiment features we engineered appear to have a negative correlation, so we will not pursue them in our data exploration. Let’s take a closer look at the parts of speech features.
b_words <- ggplot(train, aes(x = MembershipStatus, y = NumbersOfWordsInBody, fill = factor(OpenStatus))) +
stat_summary(fun.y="mean", geom="bar", position='dodge') +
labs(x = 'StackOverflow Status', y = 'Mean Number of Words in Body', title = 'Avg(Words) in Body Based on User Status') +
guides(fill=guide_legend(title='Open Status')) +
scale_fill_hue(labels=c('Closed', 'Open')) +
theme_few() +
theme(plot.title = element_text(hjust = 0.5))
t_words <- ggplot(train, aes(x = MembershipStatus, y = NumberOfWordsInTitle, fill = factor(OpenStatus))) +
stat_summary(fun.y="mean", geom="bar", position='dodge') +
labs(x = 'StackOverflow Status', y = 'Mean Number of Words in Title', title = 'Avg(Words) in Title Based on User Status') +
guides(fill=guide_legend(title='Open Status')) +
scale_fill_hue(labels=c('Closed', 'Open')) +
theme_few() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(b_words, t_words, ncol = 2)
b_words <- ggplot(train, aes(x = MembershipStatus, y = BodyNoun, fill = factor(OpenStatus))) +
stat_summary(fun.y="mean", geom="bar", position='dodge') +
labs(x = 'StackOverflow Status', y = 'Mean Number of Nouns in Body', title = 'Avg(Nouns) in Body Based on User Status') +
guides(fill=guide_legend(title='Open Status')) +
scale_fill_hue(labels=c('Closed', 'Open')) +
theme_few() +
theme(plot.title = element_text(hjust = 0.5))
t_words <- ggplot(train, aes(x = MembershipStatus, y = TitleNoun, fill = factor(OpenStatus))) +
stat_summary(fun.y="mean", geom="bar", position='dodge') +
labs(x = 'StackOverflow Status', y = 'Mean Number of Nouns in Title', title = 'Avg(Nouns) in Title Based on User Status') +
guides(fill=guide_legend(title='Open Status')) +
scale_fill_hue(labels=c('Closed', 'Open')) +
theme_few() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(b_nouns, t_nouns, ncol = 2)
Before building our model, we should remove some of the features that are collinear, as they can affect the performance of our model. Let’s see which features are collinear by using a 75% threshold and remove them:
result = findCorrelation(cor_data, cutoff=0.75)
cor_data[, result]
library(knitr)
library(data.table)
corr_tbl <- data.table(read.csv('corr.csv', sep = '\t', stringsAsFactors = T))
kable(corr_tbl)
X | NumberOfWordsInBody | BodyNoun | BodyAdjective | ReputationAtPostCreation |
---|---|---|---|---|
ReputationAtPostCreation | 0.0145523 | -0.0045167 | 0.0032893 | 1.0000000 |
TimeBetweenJoiningAndPosting | 0.0186692 | -0.0049861 | -0.0040282 | 0.2998619 |
NumberOfWordsInTitle | 0.0422945 | 0.0191837 | 0.0191390 | 0.0270069 |
NumberOfQuestionsInTitle | -0.0587115 | -0.0728418 | -0.0498797 | 0.0739489 |
NumberOfQuestionsInBody | 0.2312444 | 0.2017594 | 0.2241463 | 0.0220877 |
TagsContainHomework | 0.0356223 | 0.0292592 | 0.0273377 | -0.0084125 |
PostBodyContainsCodeFragment | 0.0441115 | 0.0499384 | 0.0543177 | -0.0039682 |
TitleNoun | 0.0649369 | 0.0543866 | 0.0281176 | -0.0056978 |
TitleAdverb | 0.0274721 | 0.0242831 | 0.0257428 | 0.0162788 |
TitleVerb | 0.0324711 | 0.0199504 | 0.0193019 | 0.0273074 |
TitleAdjective | 0.0309499 | 0.0036676 | 0.0365767 | 0.0302319 |
BodyNoun | 0.8950727 | 1.0000000 | 0.7921082 | -0.0045167 |
BodyAdverb | 0.7330888 | 0.8121151 | 0.7610987 | -0.0009667 |
BodyVerb | 0.8079848 | 0.6226818 | 0.5231279 | 0.0343449 |
BodyAdjective | 0.7487115 | 0.7921082 | 1.0000000 | 0.0032893 |
HomeworkInTitle | 0.0120184 | 0.0043175 | 0.0056380 | -0.0017060 |
HomeworkInBody | 0.0657162 | 0.0429439 | 0.0387880 | -0.0077251 |
SentimentTitle | -0.0623379 | -0.0878986 | -0.0626156 | 0.0108036 |
SentimentBody | 0.0045638 | 0.0077567 | 0.0026799 | -0.0007069 |
Year | 0.0442817 | 0.0597827 | 0.0441286 | -0.0358436 |
NumberOfTags | 0.1322587 | 0.0849063 | 0.0651901 | 0.0285871 |
DayOfWeek | 0.0038018 | 0.0012470 | 0.0070569 | -0.0034225 |
MonthNumber | -0.0033319 | -0.0028840 | 0.0007746 | -0.0037980 |
MembershipStatus | 0.0001896 | -0.0210451 | -0.0112578 | 0.2736178 |
PopularityNumber | -0.0252074 | -0.0291668 | -0.0291960 | 0.0031493 |
OwnerUndeletedAnswerCountAtPostTime | 0.0188627 | -0.0029363 | 0.0052784 | 0.9117019 |
NumberOfWordsInBody | 1.0000000 | 0.8950727 | 0.7487115 | 0.0145523 |
OpenStatus | 0.0988113 | 0.0800382 | 0.0609730 | 0.0615442 |
We can see that OwnerUndeletedAnswerCountAtPostTime is highly correlated with ReputationAtPostCreation, and that NumberOfWordsInBody is highly correlated with both BodyNoun and BodyVerb. Let’s remove the OwnerUndeletedAnswerCountAtPostTime and NumberOfWordsInBody features.
train$OwnerUndeletedAnswerCountAtPostTime <- NULL
train$NumberOfWordsInBody <- NULL
Now that we have engineered several features, we will now create a model to classify posts based on these features. First, we want to partition off some of our data for a final set of testing data. We used a 70-30% split to partition our dataset. This allows us to train the model on 70% of the data, then see how well it works with the remaining 30%.
t_idx = sample(seq_len(nrow(traindata)),size = floor(0.7 * nrow(traindata)))
# 70% to train
train = traindata[t_idx,]
# 30% to test
test = traindata[-t_idx,]
nrow(train)
nrow(test)
# Training size
124,846
# Test size
53,506
Now, let’s build our model. We’ll do this by using the randomForest package in R. First, let’s do a vanilla run to benchmark our model. We will build the model using an ntree value of 500, which specifies the number of trees to use and set the importance variable to true so that we can get the resulting feature importance statistics.
rf <- randomForest(formula = as.factor(OpenStatus) ~ ., data = train, importance = T, ntree = 500)
rf
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of error rate: 29.48%
Confusion matrix:
0 1 class.error
0 43873 18709 0.2989518
1 18095 44169 0.2906174
Not bad. We received an out-of-bag (OOB) error of 29.48%, which put very simply is the prediction error of our model. We can also see from the confusion matrix our performance in classifying each of our observations based on the two levels of our outcome variable. It looks like we are equally good at predicting both open and closed posts. Let’s look at this graphically:
plot(rf, ylim=c(0, 0.40))
From this graph, we see exactly what was shown in the confusion matrix. The black line is our OOB error (29.48), the red line is our error rate for closed questions (29.89), and the green line is our error rate for open questions (29.06).
We can also see how the number of trees used to construct the classifier impacts the resulting error rate. As the number of trees used approaches 500, we can see the error rate beginning to become parallel with the x-axis, which tells us: (1) 500 was a pretty good value for the number of trees and (2) increasing the number of trees (past 500) will not improve the error rate.
Let’s now look at the importance of each feature:
varImpPlot(rf)
From these graphs, we can see that the top features are:
We also made several efforts to improve the accuracy of our model. Our first method involved tweaking the input parameters to the randomForest algorithm. The parameters used included:
We found that an mtry value of 5, a nodesize of 12, and ntrees value of 500 provided marginally better results than our benchmark.
rf_best <- randomForest(formula = as.factor(OpenStatus) ~ ., data = train, importance = T, ntree = 500, nodesize = 12)
rf_best
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of error rate: 29.35%
Confusion matrix:
0 1 class.error
0 43900 18682 0.2985203
1 17966 44298 0.2885455
From this graph, we can see a slight improvement in our OOB error (29.35), little to no improvement in our error rate for closed questions (29.85), and a slight improvement with our error rate for open questions (28.85). Let’s see if the variable importance statistics changed at all:
varImpPlot(rf_best)
It does not look like much has changed. The Year feature seems to have climbed a few ranks in the MeanDecreaseGini graph, but other than that, not much has changed after tuning the parameters.
Our second attempt at improving our model involved using a smaller number of features. According to Han, Guo, and Yu [6], less important features can affect model performance when using Random Forests methods. Thus, they suggest the following algorithm:
In Step 1:
In Step 2:
If the error rate continues to decrease during each iteration, then we follow the method again, and keep doing it until the error rate increases, or there are no more features left [6]. Let’s try the algorithm using the same parameters used in our previous model and see what happens:
rf_han_guo_yu = randomForest(as.factor(OpenStatus) ~ BodyNoun + BodyAdverb + SentimentBody
+ BodyVerb + ReputationAtPostCreation + Year
+ TimeBetweenJoiningAndPosting + SentimentTitle + BodyAdjective
+ MonthNumber + NumberOfWordsInTitle + TitleNoun + DayOfWeek,
data = train,
importance = T,
ntree = 500,
mtry = 5,
nodesize = 12)
rf_han_guo_yu
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of error rate: 30.5%
Confusion matrix:
0 1 class.error
0 43236 19346 0.3091304
1 18738 43526 0.3009444
Oh, no! Our error rate increased. It went from 29.35 when considering all features to 30.50 when considering only half of them. I think we’ll stop here, as removing any more features will likely continue to increase our error rate.
Earlier, we split our data into a 70/30 training set and used 70% of the data to train our model. Now let’s do some predictions using the remaining 30% of test data.
prediction <- predict(rf_best, newdata = test[-26])
prediction_table = table(prediction, test$OpenStatus);
prediction_table
prediction_table 0 1
0 18552 7804
1 8042 19108
One question we may want to ask about this confusion matrix is how precise is our classifier? In other words, how many observations did we predict correctly? We can calculate this using a simple formula: \[P = \frac{TP}{TP + FP}\]
precision = prediction_table[2,2]/(sum(prediction_table[2,1:2]))
precision
0.7037937
How about recall? Recall is the proportion of all real positive observations that are correct. We can calculate this using the following formula: \[R = \frac{TP}{TP + FN}\]
recall = prediction_table[2,2]/(sum(prediction_table[1:2,2]))
recall
0.7100178
Let’s also look at the F-Score, which considers both precision and recall and can be calculated as follows: \[F = 2\frac{P*R}{P+R}\]
f1Score = (2*precision*recall)/(precision+recall)
f1Score
0.7068921
We can also look at the overall model accuracy by considering the number of correct predictions over all predictions: \[A = \frac{TP+TN}{TP+TN+FP+FN}\]
accuracy = (prediction_table[2,2] + prediction_table[1,1])/(sum(prediction_table[1:2,1:2]))
accuracy
0.7038463
RQ1: What are the main features that indicate whether a new post will remain open or become closed? As was seen in the model results, the major factors playing a role in the model are the composition and sentiment of the body. These factors were significant in determining whether a post will remain open or closed. Unfortunately, some of the variables we had hoped to be strong features, such as the presence of homework in the title or body, membership status, and whether the post had code fragments were not very important.
RQ2: Why do these features play the roles they play? When considering the post composition, the number of nouns, verbs, and so on make sense as key factors. Having too large or small of a post can impact the ability of the post to convey a meaningful question. Further, the sentiment of the post can also play a key role. If the post contains a negative sentiment, it may convey an unintended message, meaning the post will be marked to be closed. However, even a positive post can be closed if it doesn’t contain enough content.
As for the features that didn’t play major roles, we believe many of them were weak due to the insignificant numbers that were seen. Our dataset only contained approximately one-thousand posts with code fragments. This was less than one percent of the data overall. Furthermore, some students do not seem to use common keywords when posting homework questions. Our dataset had a little less than 3000 posts with homework and related terms in the body, and less than 300 in the title.
Refine our features to detect code fragments without the code tags. From this we can not only determine the impact of the different types of ways of posting code, but we can also determine if having code in a post plays a key role.
Determine how effective the parts of speech breakdown are versus using the number of words in the body and title. As we saw, most of the key features were related to post body composition. Returning these to a singular value may reveal just how useful the breakdown was in determining the post status.
Finally, our analysis of both sentiment and parts of speech included code fragments. Excluding code from these features could further refine how much of an impact they have.
In this study, we examined several features of Stack Overflow questions and created a model that can predict the open status of a question with 70% accuracy. We have also identified several key features of a Stack Overflow post that play a role in its open status. Our results are comparable to previous work [7] in this domain, but there is still room for improvement. We utilized the random forest algorithm to build or model and for prediction, however, it could be beneficial to examine the effects of other models, such as support vector machines (SVM) or gradient boosting trees (GBT) to see if these methods would offer any additional improvement in classification.
[1] Stack Overflow. (n.d.). Retrieved May 2, 2017, from https://stackoverflow.com/
[2] Reddit. (n.d.). Retrieve May 2, 2017, from https://www.reddit.com
[3] Biostars. (n.d.). Retrieve May 2, 2017, from https://www.biostars.org
[4] Quora. (n.d.). Retrieve May 2, 2017, from https://www.quora.com
[5] Kaggle. (n.d.). Retrieve May 2, 2017, from https://www.kaggle.com/c/predict-closed-questions-on-stack-overflow
[6] Han, H., Guo, X., & Yu, H. (2016, August). Variable selection using Mean Decrease Accuracy and Mean Decrease Gini based on Random Forest. In Software Engineering and Service Science (ICSESS), 2016 7th IEEE International Conference on (pp. 219-224). IEEE.
[7] Correa, D., & Sureka, A. (2013, October). Fit or unfit: analysis and prediction of’closed questions’ on stack overflow. In Proceedings of the first ACM conference on Online social networks (pp. 201-212). ACM.
[8] Correa, D., & Sureka, A. (2014, April). Chaff from the wheat: characterization and modeling of deleted questions on stack overflow. In Proceedings of the 23rd international conference on World wide web (pp. 631-642). ACM.
[9] Xia, X., Lo, D., Correa, D., Sureka, A., & Shihab, E. (2016, June). It takes two to tango: Deleted stack overflow question prediction with text and meta features. In Computer Software and Applications Conference (COMPSAC), 2016 IEEE 40th Annual (Vol. 1, pp. 73-82). IEEE.
[10] Ponzanelli, L., Mocci, A., Bacchelli, A., Lanza, M., & Fullerton, D. (2014, September). Improving low quality stack overflow post detection. In Software Maintenance and Evolution (ICSME), 2014 IEEE International Conference on (pp. 541-544). IEEE.
[11] Ponzanelli, L., Mocci, A., Bacchelli, A., & Lanza, M. (2014, October). Understanding and classifying the quality of technical forum questions. In Quality Software (QSIC), 2014 14th International Conference on (pp. 343-352). IEEE.