The following Web Appendix is organized as follows:
Section 1.1 until 1.4 describes the Data Cleaning process. Multiple approaches were taken to cleaning due to the different requirements of the feature sets. For those feature sets that required the usage of the Stanford Grammatical Dependency Parser (F2,F3,F4) only minimal cleaning steps were taken in order to ensure the functioning of the parser. Moreover, punctuation was reinserted for those feature sets that required grammatical dependency parsing.
Section 2.1 until 2.5 describe the feature extraction process. In case of more complicated feature sets (F3,F4) an overview sheet was provided (2.3.0 for F3 and 2.4.0 for F4).
Section 3.1 up to 3.15 describes the feature experimentation process including different types of feature sets. In section 3.1-3.5 Term Frequency and percentile cut-offs of the 10th, 30th, 50th, 70th and 90th percentile were considered. This process was repeated for TFIDF and Term Presence. The classifier used was Naive Bayes.
Section 4.1 until 7.2 describe the experimentation process with the feature sets 2-5. In detail the following feature sets are considered:
The Naive Bayes classifier was used.