Introduction

The corpuses that we chose were from the two sources Scientific American and Time Magazine. We specifically looked at articles that are related to sleep and picked about the same number of articles from both sources. We copied and pasted these articles into an excel spreadsheet with the column labels “Source” and “Content” according to what source it was (Time or Scientific American) and the content of the article.

Data Wrangling

Before we can do any text analysis, we need to first wrangle our data so that it’s in a suitable form for supervised and unsupervised learning. To multiply our sample size, we split each article into paragraphs using the strsplit() function to split along the in the article and unnested the paragraphs from the vector format. Then, to quickly clean up our data, we filtered the empty paragraph strings from our dataset, and assigned a document ID ‘doc_id’ to each paragraph based on the row of the paragraph in the data.

Then, we decided to conduct unsupervised learning to understand our data better. Before we can use hierarchical clustering though, we have to convert our data into a document feature matrix (DFM). This will essentially show each paragraph from the articles as the rows, and the words in the articles as the columns, with integer values counting up how many times each word was used in a particular paragraph. Source (doc_id) is on the leftmost label.

For our dendrogram, we used hierarchical clustering with our DFM data set to determine if there are any words in our data set that are incredibly different from the general sample. However, there were no particular outliers. The pairwise Euclidean distances between the words were quite large, with some up to 100. However, there were not any outliers, so we did not use the dendrogram extensively to remove any words that appeared to be much more different than the other words, or discover a word that may be important in discerning between the two sources.

articles <- read_csv('/Users/clairezhang/Downloads/STAT 0218/Homework/article.csv')
## Rows: 33 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Source, Content
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
articles_clean <- articles |>
  mutate(paragraph = strsplit(Content, "\n")) |>
  unnest(paragraph) |>
  filter(paragraph != '') |>
  mutate(doc_id = row_number())

tokens <- articles_clean |>
  select(-Content) |>
  unnest_tokens(input = "paragraph",
                output = "word",
                token = "words") |>
  filter(!(word %in% stop_words$word))

# Changing the names of the columns so that udpipe recognizes them
tokens1 <- tokens |>
  mutate(doc_id = doc_id,
         text = word)

# Using udpipe to remove the proper nouns and
# any odd strings that are now in our tokens column
tokens2 <- tokens1 |>
  udpipe(ud_model)

# Deselecting the columns are not needed and filtering
# out common names in the data set
tokens3 <- tokens2 |>
  select(-lemma, -xpos, -feats, -misc, -deps, -dep_rel) |>
  filter(upos != "PROPN",
        upos != "X") |>
  mutate(doc_id = as.numeric(doc_id)) |>
  filter(!(sentence %in% c(tolower(babynames$name), "fischman")))

# Joining the tokens with the original data set
# for the final data set
tokenized_content <- left_join(tokens3, 
          articles_clean,
          by = c("doc_id")) |>
  select(doc_id, Source, sentence)
# Quickly exploring our data
# What are the most common words?
tokenized_content |>
  count(Source, sentence) |>
  arrange(-n) |>
  head(5)
##                Source sentence   n
## 1                Time    sleep 544
## 2 Scientific American    sleep 386
## 3                Time     it’s 144
## 4 Scientific American     time 125
## 5                Time   people 106
# Convert this data set into a DOCUMENT FEATURE MATRIX (DFM)
dfm_article <- tokenized_content |>
  pivot_wider(id_cols = c("doc_id", "Source"),
              names_from = "sentence",
              values_from = "sentence",
              values_fn = length,
              values_fill = 0) |>
  select(-sleep) |>
  select(-`it’s`, -`isn’t`,-`don’t`)

# Hierarchical Clustering
# Dendrogram
dfm_article |>
  na.omit() |>
  select(-doc_id, -Source) |>
  scale() |>
  dist() |>
  hclust() |>
  plot()

Below, we are now filtering for words that show up over X amount of times, which we are assigning to our variable good_words. Then, we create our new document feature matrix from the good_words variable, and then a dendrogram from the new DFM. This is to ensure that we are only looking at the important words and to seek out any outliers from that batch of words.

article_word_counts <- dfm_article |>
  select_if(is.numeric) |>
  colSums()

X <- 30

# Take my song_word_counts vector, and only return elements which
# have a value greater than X.
good_words <- names(article_word_counts[which(article_word_counts > X)])

dfm_article1 <- dfm_article |>
  select(Source, all_of(good_words), doc_id)

# New dendrogram with only our good words!
dfm_article1 |>
  select(-doc_id) |>
  dist() |>
  hclust() |>
  plot()
## Warning in dist(select(dfm_article1, -doc_id)): NAs introduced by coercion

## Conducting PCA

We conducted PCA unsupervised and found that some words co-occurred together more than an average amount. PC1 takes all our variables that co-occurred in similar amounts and collapsed them into a single score while maintaining as much information regarding variance as possible. For instance, “people” and “time” have a similar loading/weight within our PC1 so whenever there is the word “people” in an article, “time” is very likely to show up with it. When looking into the paragraphs where the PC1 score is fairly high, these paragraphs more often than not discussed how people sleep.

# PCA - Unsupervised Learning
pca_articles <- dfm_article1 |>
  select(-doc_id, -Source) |>
  prcomp()

# Observing PC1
pca_articles$rotation |>
  data.frame() |>
  select(PC1) |>
  arrange(PC1)
##                      PC1
## noise       -0.032220753
## system      -0.031172459
## magnesium   -0.020546435
## body        -0.020304335
## i’m         -0.016488628
## cells       -0.016241848
## aren’t      -0.014921236
## activity    -0.013382834
## scientists  -0.009585753
## we’re       -0.008925307
## benefits    -0.008236009
## melatonin   -0.007321367
## medicine    -0.003433965
## university   0.002568452
## bedtime      0.006980367
## information  0.008091588
## asleep       0.012502100
## disease      0.014850624
## effects      0.018080731
## blood        0.018114590
## professor    0.019664237
## there’s      0.020501465
## experts      0.020764917
## that’s       0.022443166
## sleeping     0.024358654
## adults       0.024549971
## learning     0.027484119
## including    0.029142299
## evidence     0.031197531
## healthy      0.032173582
## quality      0.033291597
## risk         0.035493704
## you’re       0.039089048
## feel         0.040259815
## didn’t       0.041918564
## doesn’t      0.044148995
## u.s          0.044711713
## studies      0.045351226
## alzheimer’s  0.052856657
## hour         0.059290529
## researchers  0.071680265
## health       0.073467171
## times        0.074798428
## start        0.078562619
## school       0.089240651
## research     0.104846593
## dst          0.110179171
## found        0.122469434
## hours        0.130894959
## night        0.177357575
## study        0.179955410
## bed          0.182870690
## a.m          0.187759063
## people       0.338532648
## time         0.797465708

This lets us look at the PC1 score for each paragraph.

# Related to changes in the brain during sleep
articles_clean |>
  slice(15)
## # A tibble: 1 × 4
##   Source              Content                                   paragraph doc_id
##   <chr>               <chr>                                     <chr>      <int>
## 1 Scientific American "Ah, to sleep, perchance … to shrink you… The rese…     15
# Topic is sleeping on the floor
articles_clean |>
  slice(562)
## # A tibble: 1 × 4
##   Source              Content                                   paragraph doc_id
##   <chr>               <chr>                                     <chr>      <int>
## 1 Scientific American "If lower back pain is keeping you up at… The floo…    562

Supervised Learning

To move on to our supervised learning, we decided to test LDA, kNN, and random forest. First, we mutated the ‘Source’ variable onto the document feature matrix. This way, we can create our train and test data to train our model on and eventually create predictions on our test data. Then, we split our data in half to create two data sets: train and test data. After loading up the caret package, we did a random forest. To tune our random forest, we changed the mtry, the splitrule, and the min.node.size. The mtry is the number of variables that should be considered at each split of the tree. In this case, we provided a range of three mtry values to test out, and the model chose mtry = 3 as the optimal mtry value. This means that our model is trying all of the possible splits with three variables, or words in this case, from our data set. The splitrule determines how the tree should be split along, which we set as “gini”, for less variation in the nodes, and “extratrees” so that the splits are randomly created to have diversity and prevent too much variation in the predictions. The min.node.size is set from a range of 1-3 because it sets the depth of our tree. Our model chose the min.node.size = 1 as the optimal value to create a tree that is complex and has enough depth, but may have more variability. We also set the num.trees to 800 to help with the bias-variance trade-off, as well as making the model stabilize. After 800 trees, the kappa is slightly lower and therefore the model is not really better than random guessing. Finally, We set variable importance to “impurity” to show the variables that have the most influence on the predictions. The variable importance function helped us determine with the older data that there were words such as ‘it’s’, ‘don’t’, etc. that were counted as very important in determining the source from a word. To mitigate this problem, we removed the pronouns that were not helpful using udpipe in the data wrangling from the beginning.

Then, we felt that it would be important to try out kNN and LDA as well, so we used the caret package to perform kNN and LDA. Our models’ performances were not that ideal, both with a lower kappa and accuracy than our original random forest model. For that reason, we decided to scrap it. We created confusion matrices for all of the models as well to see where most of the predictions were concentrated.

Our accuracy was not particularly high, at around 64.50%, and our kappa at .292. We believe that this is the best accuracy and kappa combination possible given this context of predicting sleep articles written by either Time Magazine or Scientific American. It is not necessarily much better than random guessing, but because we picked magazines that were simply related to sleep, even if the subject was not exactly sleep, it makes sense that the accuracy was not perfect. In addition, at times, Time Magazine would write longer articles backed with scientific facts, and Scientific American would sometimes write shorter articles that were more playful and less facts-focused. It isn’t unlikely that there were many common words between the two sources.

# Creating articles_data from our DFM to train and test
articles_data <- dfm_article1 |>
  mutate(Source = factor(Source)) |>
  select(-doc_id) |>
  na.omit()

# Splitting our data in half
training_rows <- sample(1:nrow(articles_data), 
                        size = nrow(articles_data)/2)

train_data <- articles_data[training_rows, ]
test_data <- articles_data[-training_rows, ]

# Setting a specific tunegrid to achieve the
# best kappa and accuracy for our articles
tunegrid <- expand.grid(mtry = c(1, 2, 3),
                                  splitrule = c("gini", "extratrees"),
                                  min.node.size = 1:3)
# Random Forest
article_rf <- train(Source ~.,
                 data = train_data,
                 method = "ranger",
                 tuneGrid = tunegrid,
                 importance = "impurity",
                 num.trees = 800)

# Creating predictions
predictions_rf <- factor(predict(article_rf, newdata = test_data))

# Let's look at variable importance
varImp(article_rf)
## ranger variable importance
## 
##   only 20 most important variables shown (out of 55)
## 
##               Overall
## dst            100.00
## cells           80.23
## hours           69.74
## noise           64.97
## people          60.43
## found           56.89
## bedtime         56.01
## night           50.99
## u.s             50.51
## quality         48.66
## `i’m`           46.44
## time            45.14
## study           39.81
## sleeping        39.73
## experts         38.67
## blood           37.33
## `alzheimer’s`   36.67
## researchers     36.19
## learning        36.08
## research        33.14
# Build a confusion matrix
confusionMatrix(predictions_rf, test_data$Source)
## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Scientific American Time
##   Scientific American                 122   63
##   Time                                 55  103
##                                              
##                Accuracy : 0.656              
##                  95% CI : (0.6031, 0.7062)   
##     No Information Rate : 0.516              
##     P-Value [Acc > NIR] : 1.109e-07          
##                                              
##                   Kappa : 0.3102             
##                                              
##  Mcnemar's Test P-Value : 0.5193             
##                                              
##             Sensitivity : 0.6893             
##             Specificity : 0.6205             
##          Pos Pred Value : 0.6595             
##          Neg Pred Value : 0.6519             
##              Prevalence : 0.5160             
##          Detection Rate : 0.3557             
##    Detection Prevalence : 0.5394             
##       Balanced Accuracy : 0.6549             
##                                              
##        'Positive' Class : Scientific American
## 

Here are the other models that we tried to use, but ended up not working as well as our random forest.

# kNN
article_knn <- train(Source ~ ., 
      data = train_data, 
      method = "knn", 
      tuneLength = 5)

# LDA
article_lda <- train(Source ~ ., 
                     data = train_data, 
                     method = "lda",
                     prior = c(0.495, 0.505))

predictions_knn <- factor(predict(article_knn, newdata = test_data))
predictions_lda <- predict(article_lda, newdata = test_data)

confusionMatrix(predictions_knn, test_data$Source)
## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Scientific American Time
##   Scientific American                 138   90
##   Time                                 39   76
##                                              
##                Accuracy : 0.6239             
##                  95% CI : (0.5703, 0.6754)   
##     No Information Rate : 0.516              
##     P-Value [Acc > NIR] : 3.647e-05          
##                                              
##                   Kappa : 0.2398             
##                                              
##  Mcnemar's Test P-Value : 1.071e-05          
##                                              
##             Sensitivity : 0.7797             
##             Specificity : 0.4578             
##          Pos Pred Value : 0.6053             
##          Neg Pred Value : 0.6609             
##              Prevalence : 0.5160             
##          Detection Rate : 0.4023             
##    Detection Prevalence : 0.6647             
##       Balanced Accuracy : 0.6187             
##                                              
##        'Positive' Class : Scientific American
## 
confusionMatrix(predictions_lda, test_data$Source)
## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Scientific American Time
##   Scientific American                 129   68
##   Time                                 48   98
##                                              
##                Accuracy : 0.6618             
##                  95% CI : (0.609, 0.7117)    
##     No Information Rate : 0.516              
##     P-Value [Acc > NIR] : 3.294e-08          
##                                              
##                   Kappa : 0.3204             
##                                              
##  Mcnemar's Test P-Value : 0.07771            
##                                              
##             Sensitivity : 0.7288             
##             Specificity : 0.5904             
##          Pos Pred Value : 0.6548             
##          Neg Pred Value : 0.6712             
##              Prevalence : 0.5160             
##          Detection Rate : 0.3761             
##    Detection Prevalence : 0.5743             
##       Balanced Accuracy : 0.6596             
##                                              
##        'Positive' Class : Scientific American
## 

Creating the Visualization

We wanted to compare the frequency and importance of certain words between different sources, so the visualization that we settled on was a bar graph showing the frequency of the 10 most important words determined by variable importance, then separated by magazine source (Time or Scientific American) by color. In the diagram, each bar represents a word, with its height corresponding to its frequency in the respective magazine. For example, from the diagram, we can see that the word “researchers” was used more by Scientific American, while “quality” appeared more frequently in Time. This is expected, since Scientific American often focuses on research-driven topics and scientific studies, making terms like “researchers” more relevant to its content. In contrast, Time is catered to broader, more general audiences, which aligns with the higher frequency of a word like “quality”. Interestingly, the word “cells” —a scientific leaning term— is used more by Time, but that might be due to one of the selected articles having a focus on how sleep affects cellular processes.

# CREATING OUR VISUALIZATION
# Convert importance to a data frame
importance <- varImp(article_rf)
varimp_data <- data.frame(importance$importance) |>
  arrange(-Overall)

varimp_data$words <- row.names(varimp_data)   
colnames(varimp_data) <- c("importance", "words")

# Extracting the top 10 most important words
top_words_var <- varimp_data |>
  head(10) 

imp_dfm <- dfm_article1 |>
  pivot_longer(cols = -Source,  # Exclude the 'Document' column from being pivoted
               names_to = "words",  # Column to store words
               values_to = "Count")
# Creating the data we're going to use for
# our visualization
visual_data <- top_words_var |>
  left_join(imp_dfm, by = "words") |>
  group_by(Source, words) |>
  summarize(total_count = sum(Count))
## `summarise()` has grouped output by 'Source'. You can override using the
## `.groups` argument.
visual_data |>
  na.omit() |>
  ggplot() +
  geom_col(aes(x = words, 
               y = total_count, 
               fill = Source), 
           position = "dodge") +
  labs(x = "Most Important Words", y = "Total Count", title = "Frequency of Most Important Words Based on Source") +
  theme_minimal()