Preprocessing and Document Features
This project will explore if, and how, online sentiment changed within the Christian, Jewish, and Muslim faith subreddit communities during the pandemic. The goal is to conduct a comparative, sentiment analysis of three major subreddit faith communities, r/Christianity, r/Judaism, and r/Islam during the first 2 years of the pandemic. Initial questions to guide the research include:
• How did the content and sentiment within these subreddit groups change over time?
• How does emotionality and sentiment compare between the subreddit faith groups?
• How did sentiment within the faith communities trend in comparison to that of r/Covid19?
Options for Preprocessing
There are several options for preprocessing. Below are the choices I made and why.
• Capitalization – removed; no substantive change in meaning for upper or lowercase words in this data set. Most proper nouns are distinguishable even when lowercase.
• Punctuation – removed; little additional meaning gathered from punctuation for my purposes
• Stopwords – removed; did not add substantive meaning to research question. • Numbers – removed; a scan showed most numbers were related to years, ages, and/or percentages which added little additional meaning/ context for my research question.
• Stemming – skipped; at this stage I did not see a need to include this step. Insights were gathered from lemmas in later phases.
• Infrequent Terms – removed; words appearing less than 1% were removed as the aim is find trends in sentiment rather than outliers or a comprehensive listing.
• N-grams – skipped; terms may be of use in understanding context of discussion. Insights gathered during dependency parsing.
I hoped to use the pretext package and function to review the sensitivity in making these selections, however I was unable to install this package.
Readability
I also conducted a readability assessment using Flesch.Kincaid, FOG, and Coleman.Liau.grade assessments. Overall, we saw a slight variation in average readability scores.
document Flesch.Kincaid FOG
Length:3481 Min. :-3.400 Min. : 0.400
Class :character 1st Qu.: 4.791 1st Qu.: 6.454
Mode :character Median : 7.885 Median :10.510
Mean : 8.501 Mean :11.077
3rd Qu.:11.910 3rd Qu.:15.033
Max. :67.400 Max. :69.934
NA's :7 NA's :7
Coleman.Liau.grade
Min. :-39.508
1st Qu.: 4.980
Median : 8.631
Mean : 8.753
3rd Qu.: 12.725
Max. : 89.849
NA's :7
The average score by scale consist of Flesch.Kincaid at 8.501 (or 8th grade level); FOG at 11.077 (or 11th grade level); and finally Coleman.Liau.grade at 8.753 (or 8-9th grade level). Results show average scores are conversational and fairly easy to read. Note, scores with very low readability typically included documents with one or very few words while scores at the very high end included longer documents as well as, surprisingly, documents with sole hash tags (ie. #JusticeForGeorgeFloyd and #JewishLivesMatter).
The correlation of readability scores are as follows.
[1] 0.8943008
[1] 0.6835212
Here we see the correlation between Flesch.Kincaid and FOG is 89% while that of Coleman.Liau.grade and FOG is 68%.
The next few tables and charts, plot the overall top features across the DFM (this included all years and subreddits). Note, single character tokens are present (resulting from punctuation removal).
The first table includes the top 50 features, while the second table includes the number of documents containing the first 20 features. This differentiator lets us know how widespread the term was used vs. high usage with fewer documents.
covid-19 god just people s will
517 501 432 384 364 310
like can t us life time
303 303 275 275 271 258
love one jesus now know first
258 253 235 228 224 223
pray jewish years day please really
217 201 200 199 190 185
new today much thank church sars-cov-2
181 181 180 174 169 169
may vaccine christian m muslim got
165 163 159 155 154 154
never want even christ family get
152 148 146 143 143 143
good bible islam going feel see
142 139 136 133 131 130
say made
129 128
feature frequency rank docfreq group
1 covid-19 517 1 485 all
2 god 501 2 237 all
3 just 432 3 257 all
4 people 384 4 228 all
5 s 364 5 226 all
6 will 310 6 171 all
7 like 303 7 191 all
8 can 303 7 197 all
9 t 275 9 137 all
10 us 275 9 172 all
11 life 271 11 141 all
12 time 258 12 173 all
13 love 258 12 151 all
14 one 253 14 181 all
15 jesus 235 15 152 all
16 now 228 16 154 all
17 know 224 17 149 all
18 first 223 18 193 all
19 pray 217 19 149 all
20 jewish 201 20 166 all
Additional steps needed to remove single character words.
Visualizing Term Frequencies
The word cloud below represents the top 150 terms/features expressed over the two years of the pandemic, 2020 - 2021.
Lemmas
As mentioned earlier, I skipped stemming in lieu of reviewing lemmas. Below is selection of top 20 lemmatized nouns, verbs, and proper nouns across three years.
year month day timestamp subreddit comments doc_id sid tid
1 2019 01 01 1546341461 ISLAM 151 1 1 1
2 2019 01 01 1546341461 ISLAM 151 1 1 2
3 2019 01 01 1546341461 ISLAM 151 1 1 3
4 2019 01 01 1546341461 ISLAM 151 1 1 4
5 2019 01 01 1546341461 ISLAM 151 1 1 5
6 2019 01 01 1546341461 ISLAM 151 1 1 6
token token_with_ws lemma upos xpos
1 Alhamdulillah Alhamdulillah Alhamdulillah INTJ UH
2 , , , PUNCT ,
3 I I I PRON PRP
4 successfully successfully successfully ADV RB
5 completed completed complete VERB VBD
6 10 10 10 NUM CD
feats tid_source relation
1 <NA> 5 discourse
2 <NA> 5 punct
3 Case=Nom|Number=Sing|Person=1|PronType=Prs 5 nsubj
4 <NA> 5 advmod
5 Mood=Ind|Tense=Past|VerbForm=Fin 0 root
6 NumType=Card 7 nummod
For nouns, 6 out of the 20 top terms referred to time (i.e. year, day, today), while person groups included people, family, church (this could also refer to a place), and man. Love also appears as a noun in this list.
# A tibble: 20 x 2
lemma count
<chr> <int>
1 "Covid" 420
2 "people" 382
3 "year" 319
4 "day" 317
5 "time" 315
6 "life" 301
7 "today" 193
8 "prayer" 183
9 "thing" 172
10 "\u0019" 167
11 "family" 164
12 "church" 145
13 "\u001d" 124
14 "world" 121
15 "man" 118
16 "edit" 116
17 "week" 110
18 "month" 109
19 "post" 109
20 "love" 108
For verbs, many standard terms present with exceptions for terms like pray, love, give, think, thank, and start.
# A tibble: 20 x 2
lemma count
<chr> <int>
1 "have" 598
2 "go" 371
3 "get" 330
4 "do" 328
5 "make" 315
6 "pray" 300
7 "know" 292
8 "say" 284
9 "feel" 281
10 "want" 257
11 "be" 252
12 "see" 248
13 "love" 222
14 "think" 207
15 "give" 196
16 "thank" 174
17 "take" 173
18 "come" 171
19 "\u0019" 170
20 "start" 154
For proper nouns, top terms primarily refer to text and language specific to these community groups such as God, Jesus, Muslims, Jews, Israel, Bible, and Quran. We also see a strong emergence of pandemic related terms (ie. COVID, Vaccine, SARS).
# A tibble: 20 x 2
lemma count
<chr> <int>
1 God 487
2 Jesus 230
3 Christ 143
4 CoV 142
5 Islam 121
6 SARS 119
7 Allah 112
8 Covid 101
9 Jews 100
10 Christians 96
11 Lord 86
12 Vaccine 83
13 Christianity 76
14 christian 72
15 Muslims 72
16 COVID 71
17 Bible 70
18 Christian 62
19 Israel 61
20 Quran 57
Dependency Parsing
This stage was conducted by year to see if there were any similarities or differences. Ultimately, I was unable to sort the list by frequency (which would have been more valuable) than chronological order as listed here.
The 2019 results:
[1] "Quran => arabic" "prayers => all"
[3] "promise => make" "prayers => all"
[5] "prayers => my" "prayers => missed"
[7] "day => one" "day => program"
[9] "fardh => one" "fardh => prayer"
[11] "what => did" "promise => my"
[13] "promise => completion" "request => isn"
[15] "simce => much" "simce => knowledge"
[17] "distractions => all" "distractions => needless"
[19] "distractions => shirk" "courses => online"
The 2020 results:
[1] "people => these" "people => two"
[3] "people => wonderful" "males => all"
[5] "abortion => marriage" "abortion => communism"
[7] "infidels => such" "infidels => opportunity"
[9] "heresies => their" "moms => my"
[11] "carpet => his" "carpet => own"
[13] "carpet => praying" "carpet => mosque"
[15] "group => support" "God => who"
[17] "you => all" "top => football"
[19] "top => name" "Israeli => an"
The 2021 results:
[1] "drugs => associated" "painting => this"
[3] "Country => my" "blessings => my"
[5] "spray => nasal" "risk => death"
[7] "ayahs => these" "ayahs => beautiful"
[9] "praise => more" "credit => all"
[11] "\030very => another" "person => new"
[13] "lunch => celebratory" "treatments => cut"
[15] "time => hospital" "lives => children"
[17] "Anti-Abortion => both" "Anti-Abortion => see"
[19] "end => abortion" "end => making"