Text as Data: Blog#3

Preprocessing and Document Features

Kalimah Muhammad
2022-04-14

Background and Research Questions

This project will explore if, and how, online sentiment changed within the Christian, Jewish, and Muslim faith subreddit communities during the pandemic. The goal is to conduct a comparative, sentiment analysis of three major subreddit faith communities, r/Christianity, r/Judaism, and r/Islam during the first 2 years of the pandemic. Initial questions to guide the research include:
• How did the content and sentiment within these subreddit groups change over time?
• How does emotionality and sentiment compare between the subreddit faith groups?
• How did sentiment within the faith communities trend in comparison to that of r/Covid19?

Preprocessing

Options for Preprocessing
There are several options for preprocessing. Below are the choices I made and why.

• Capitalization – removed; no substantive change in meaning for upper or lowercase words in this data set. Most proper nouns are distinguishable even when lowercase.
• Punctuation – removed; little additional meaning gathered from punctuation for my purposes
• Stopwords – removed; did not add substantive meaning to research question. • Numbers – removed; a scan showed most numbers were related to years, ages, and/or percentages which added little additional meaning/ context for my research question.
• Stemming – skipped; at this stage I did not see a need to include this step. Insights were gathered from lemmas in later phases.
• Infrequent Terms – removed; words appearing less than 1% were removed as the aim is find trends in sentiment rather than outliers or a comprehensive listing.
• N-grams – skipped; terms may be of use in understanding context of discussion. Insights gathered during dependency parsing.
I hoped to use the pretext package and function to review the sensitivity in making these selections, however I was unable to install this package.

Readability
I also conducted a readability assessment using Flesch.Kincaid, FOG, and Coleman.Liau.grade assessments. Overall, we saw a slight variation in average readability scores.

   document         Flesch.Kincaid        FOG        
 Length:3481        Min.   :-3.400   Min.   : 0.400  
 Class :character   1st Qu.: 4.791   1st Qu.: 6.454  
 Mode  :character   Median : 7.885   Median :10.510  
                    Mean   : 8.501   Mean   :11.077  
                    3rd Qu.:11.910   3rd Qu.:15.033  
                    Max.   :67.400   Max.   :69.934  
                    NA's   :7        NA's   :7       
 Coleman.Liau.grade
 Min.   :-39.508   
 1st Qu.:  4.980   
 Median :  8.631   
 Mean   :  8.753   
 3rd Qu.: 12.725   
 Max.   : 89.849   
 NA's   :7         

The average score by scale consist of Flesch.Kincaid at 8.501 (or 8th grade level); FOG at 11.077 (or 11th grade level); and finally Coleman.Liau.grade at 8.753 (or 8-9th grade level). Results show average scores are conversational and fairly easy to read. Note, scores with very low readability typically included documents with one or very few words while scores at the very high end included longer documents as well as, surprisingly, documents with sole hash tags (ie. #JusticeForGeorgeFloyd and #JewishLivesMatter).

The correlation of readability scores are as follows.

[1] 0.8943008
[1] 0.6835212

Here we see the correlation between Flesch.Kincaid and FOG is 89% while that of Coleman.Liau.grade and FOG is 68%.

Document Features and Frequency

The next few tables and charts, plot the overall top features across the DFM (this included all years and subreddits). Note, single character tokens are present (resulting from punctuation removal).

The first table includes the top 50 features, while the second table includes the number of documents containing the first 20 features. This differentiator lets us know how widespread the term was used vs. high usage with fewer documents.

  covid-19        god       just     people          s       will 
       517        501        432        384        364        310 
      like        can          t         us       life       time 
       303        303        275        275        271        258 
      love        one      jesus        now       know      first 
       258        253        235        228        224        223 
      pray     jewish      years        day     please     really 
       217        201        200        199        190        185 
       new      today       much      thank     church sars-cov-2 
       181        181        180        174        169        169 
       may    vaccine  christian          m     muslim        got 
       165        163        159        155        154        154 
     never       want       even     christ     family        get 
       152        148        146        143        143        143 
      good      bible      islam      going       feel        see 
       142        139        136        133        131        130 
       say       made 
       129        128 
    feature frequency rank docfreq group
1  covid-19       517    1     485   all
2       god       501    2     237   all
3      just       432    3     257   all
4    people       384    4     228   all
5         s       364    5     226   all
6      will       310    6     171   all
7      like       303    7     191   all
8       can       303    7     197   all
9         t       275    9     137   all
10       us       275    9     172   all
11     life       271   11     141   all
12     time       258   12     173   all
13     love       258   12     151   all
14      one       253   14     181   all
15    jesus       235   15     152   all
16      now       228   16     154   all
17     know       224   17     149   all
18    first       223   18     193   all
19     pray       217   19     149   all
20   jewish       201   20     166   all

Additional steps needed to remove single character words.

Visualizing Term Frequencies

The word cloud below represents the top 150 terms/features expressed over the two years of the pandemic, 2020 - 2021.

Lemmas
As mentioned earlier, I skipped stemming in lieu of reviewing lemmas. Below is selection of top 20 lemmatized nouns, verbs, and proper nouns across three years.

  year month day  timestamp subreddit comments doc_id sid tid
1 2019    01  01 1546341461     ISLAM      151      1   1   1
2 2019    01  01 1546341461     ISLAM      151      1   1   2
3 2019    01  01 1546341461     ISLAM      151      1   1   3
4 2019    01  01 1546341461     ISLAM      151      1   1   4
5 2019    01  01 1546341461     ISLAM      151      1   1   5
6 2019    01  01 1546341461     ISLAM      151      1   1   6
          token token_with_ws         lemma  upos xpos
1 Alhamdulillah Alhamdulillah Alhamdulillah  INTJ   UH
2             ,            ,              , PUNCT    ,
3             I            I              I  PRON  PRP
4  successfully successfully   successfully   ADV   RB
5     completed    completed       complete  VERB  VBD
6            10           10             10   NUM   CD
                                       feats tid_source  relation
1                                       <NA>          5 discourse
2                                       <NA>          5     punct
3 Case=Nom|Number=Sing|Person=1|PronType=Prs          5     nsubj
4                                       <NA>          5    advmod
5           Mood=Ind|Tense=Past|VerbForm=Fin          0      root
6                               NumType=Card          7    nummod

For nouns, 6 out of the 20 top terms referred to time (i.e. year, day, today), while person groups included people, family, church (this could also refer to a place), and man. Love also appears as a noun in this list.

# A tibble: 20 x 2
   lemma    count
   <chr>    <int>
 1 "Covid"    420
 2 "people"   382
 3 "year"     319
 4 "day"      317
 5 "time"     315
 6 "life"     301
 7 "today"    193
 8 "prayer"   183
 9 "thing"    172
10 "\u0019"   167
11 "family"   164
12 "church"   145
13 "\u001d"   124
14 "world"    121
15 "man"      118
16 "edit"     116
17 "week"     110
18 "month"    109
19 "post"     109
20 "love"     108

For verbs, many standard terms present with exceptions for terms like pray, love, give, think, thank, and start.

# A tibble: 20 x 2
   lemma    count
   <chr>    <int>
 1 "have"     598
 2 "go"       371
 3 "get"      330
 4 "do"       328
 5 "make"     315
 6 "pray"     300
 7 "know"     292
 8 "say"      284
 9 "feel"     281
10 "want"     257
11 "be"       252
12 "see"      248
13 "love"     222
14 "think"    207
15 "give"     196
16 "thank"    174
17 "take"     173
18 "come"     171
19 "\u0019"   170
20 "start"    154

For proper nouns, top terms primarily refer to text and language specific to these community groups such as God, Jesus, Muslims, Jews, Israel, Bible, and Quran. We also see a strong emergence of pandemic related terms (ie. COVID, Vaccine, SARS).

# A tibble: 20 x 2
   lemma        count
   <chr>        <int>
 1 God            487
 2 Jesus          230
 3 Christ         143
 4 CoV            142
 5 Islam          121
 6 SARS           119
 7 Allah          112
 8 Covid          101
 9 Jews           100
10 Christians      96
11 Lord            86
12 Vaccine         83
13 Christianity    76
14 christian       72
15 Muslims         72
16 COVID           71
17 Bible           70
18 Christian       62
19 Israel          61
20 Quran           57

Dependency Parsing
This stage was conducted by year to see if there were any similarities or differences. Ultimately, I was unable to sort the list by frequency (which would have been more valuable) than chronological order as listed here.

The 2019 results:

 [1] "Quran => arabic"          "prayers => all"          
 [3] "promise => make"          "prayers => all"          
 [5] "prayers => my"            "prayers => missed"       
 [7] "day => one"               "day => program"          
 [9] "fardh => one"             "fardh => prayer"         
[11] "what => did"              "promise => my"           
[13] "promise => completion"    "request => isn"          
[15] "simce => much"            "simce => knowledge"      
[17] "distractions => all"      "distractions => needless"
[19] "distractions => shirk"    "courses => online"       

The 2020 results:

 [1] "people => these"         "people => two"          
 [3] "people => wonderful"     "males => all"           
 [5] "abortion => marriage"    "abortion => communism"  
 [7] "infidels => such"        "infidels => opportunity"
 [9] "heresies => their"       "moms => my"             
[11] "carpet => his"           "carpet => own"          
[13] "carpet => praying"       "carpet => mosque"       
[15] "group => support"        "God => who"             
[17] "you => all"              "top => football"        
[19] "top => name"             "Israeli => an"          

The 2021 results:

 [1] "drugs => associated"   "painting => this"     
 [3] "Country => my"         "blessings => my"      
 [5] "spray => nasal"        "risk => death"        
 [7] "ayahs => these"        "ayahs => beautiful"   
 [9] "praise => more"        "credit => all"        
[11] "\030very => another"   "person => new"        
[13] "lunch => celebratory"  "treatments => cut"    
[15] "time => hospital"      "lives => children"    
[17] "Anti-Abortion => both" "Anti-Abortion => see" 
[19] "end => abortion"       "end => making"