Introduction

The internet is doubling in size every 2 years, with it content such as news, media or ebooks keeps increasing as time passes. This means that in our day to day lives, the challenge isn’t data availability but browsing data to find useful and relevant information. For this project, I chose to attempt summarizing an online article by Dr. Natalia Mehlman Petrzela (https://nataliapetrzela.com), on the history of home exercise titled: “Get in Shape Girl: A Century of Working Out from Home”, then reading the article to see how well the summary captured the content of the article.

Text Summarization is an application of Natural Language Processing (NLP) that utilizes many of the techniques and algorithms explored through this class. Exploration of automatic summarization started in the 1940s during World War II (Prasasthy, 2022) but didn’t gain popularity until the 1980 where NLP and AI started taking off. There are two main categories of text summarization: extraction and abstraction (Joshi, 2020).

Given than extractive methods of summarization use word frequency and other langange modeling techniques to weight sentences and words (Erkan & Radev, 2004), I will proceed with this method for the following project.

Extractive summarization is done through three main steps (Allahyari et al., 2017):

  1. Building an intermediate representation of the text to summarize based on chosen attributes:
  1. Score the sentences based on the representation.
  1. Select a summary comprising of a number of sentences.

Research Question

The goal of this analysis is to test how accurate can using Extractive Summarization methods be in summarizing the chosen article, therefore can we rely on the produced summary to retrieve the most important information contained in a documents which would ease the consumption of online media.

Method

I chose to summarize the article: “Get in Shape Girl: A Century of Working Out from Home” by Dr. Natalia Mehlman Petrzela, this was retrieved from Jezebel.com, the article was published on May 20th, 2020

Link to the article: https://jezebel.com/get-in-shape-girl-a-century-of-working-out-from-home-1843416457

Data will be collected by scraping the contents of the article from jezebel.com using rvest in R then transferring it to python where I will conduct the analysis, according to Ken (2021) a summary should contain between 6 and 8 sentences, given that this article is longer than regular articles (106 sentences total) I believe that 8 sentences will help me capture the most important content of the article.

As mentioned earlier, extractive summarization will be used for this project, the article’s content will be tokenized and ranked based on words importance and frequency, then only top 8 sentences will be retained.

After the automatic summary is generated, I will read the article and see if the summary captured the most important ideas well and what improvements could be added to the algorithm in the future to improve summarization.

Analysis

The analysis will follow these steps: - Data import using rvest - Transfer to python and converting from a dataframe to a combined str text. - Tokenizing the article’s content using spacy - Calculating word frequency to identify most frequent words, - Using word frequency to rank sentences, - Choosing top 8 sentences to form the summary

Libraries

library(rvest)
library(reticulate)
import pandas as pd
import nltk
import string
import spacy
nlp = spacy.load("en_core_web_sm") #small English model for speed
from collections import Counter #for word frequency
from string import punctuation #to clean text from punctuation
from heapq import nlargest

Loading article

text <- read_html("https://jezebel.com/get-in-shape-girl-a-century-of-working-out-from-home-1843416457") %>% html_nodes("p") %>% html_text()

text <- as.data.frame(text, header = FALSE)
# transfer to python
article = r.text
article = pd.DataFrame.from_dict(article)
article = ' '.join(article.text)

Creating the summarization function

def ext_summary(text, limit):
  tagger = []
  pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
  clean = nlp(text.lower())
  for token in clean:
    if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
      continue
    if(token.pos_ in pos_tag):
      tagger.append(token.text)
   
  word_frequencies = Counter(tagger)
  max_freq = Counter(tagger).most_common(1)[0][1]
  for w in word_frequencies:
    word_frequencies[w] = (word_frequencies[w]/max_freq)
      
  top_sentences={}
  for sent in clean.sents:
    for word in sent:
      if word.text in word_frequencies.keys():
        if sent in top_sentences.keys():
          top_sentences[sent]+=word_frequencies[word.text]
        else:
          top_sentences[sent]=word_frequencies[word.text]
  
  summary = []
  sentence_scores = sorted(top_sentences.items(), key=lambda kv: kv[1], reverse=True)
  counter = 0
  for i in range(len(sentence_scores)):
    summary.append(str(sentence_scores[i][0]).capitalize())
        
    counter += 1
    if(counter >= limit):
      break
  
  return ' '.join(summary)

Generate the Summary

print("Below is a summary of the article:\n 'Get in Shape Girl: A Century of Working Out from Home' \n by Dr. Natalia Mehlman Petrzela\n")
## Below is a summary of the article:
##  'Get in Shape Girl: A Century of Working Out from Home' 
##  by Dr. Natalia Mehlman Petrzela
print (ext_summary(article, 8))
## Home exercise is a multi-billion dollar industry with roots a century old, but it’s not so surprising it took being  housebound for him to “discover” it, or that my preferred platform, obé fitness, has a pastel-pink aesthetic: working out from home has for decades been marketed mostly to women assumed both to spend more time in the house and more energy on their appearance. For a generation of women taught that exercise was unladylike and even dangerous, lalanne’s thirty-minute, black-and-white show made the radical proposition of claiming that women should make time for exercise, for their own sake. In a moment when tv married couples slept in separate beds and toilets were conspicuously absent from bathrooms, the unapologetic physicality of exercise television could understandably be titillating: in 1972, cosmopolitan reported on a swinging chattanooga couple who invited friends to their home to watch the television show yoga for health as foreplay, after which “the slide into group sex, southern style, came easily enough.” For over a century before the much-debated peloton and its offshoots grandly proclaimed to “democratize” fitness through digital technology, americans have been enticed to exercise at home through products that appeal to our desire not only for personal transformation, but for perpetual productivity and privacy. Her exercise album how to keep your husband happy made clear fitness would help women allay their insecurities, but not by challenging the idea their self-worth was predicated on their desirability to men. Men were aggressively marketed at-home fitness in the 1920s, when the whole idea of “purposive exercise” began to take hold. “fitness centers muscling into the home,” read a typical headline announcing this architectural innovation allowing homeowners to “avoid the crowds and intimidation” of irl gyms, and to literally build their disciplined commitment to fitness into domestic life. If men had to be aggressively convinced that they should exercise, fitness was far more socially acceptable for women, especially if it was in pursuit of beauty—an acceptable feminine aspiration—not a gateway to brutish sport.

Discussion

My first observation from looking at the generated summary, all 8 sentences are combined in one giant paragraph, an improvement would be to list them according to where they stand in the original article with lines and paragraphs formed accordingly to make the summary easy and pleasant to read.

The summary starts with a sentence referring to someone as “him” while we can’t know who this is referring to from the summary, after reading the article I can deduct that the author is referring to her husband. As I continue reading this article, it is clear that there’s a need for sentences to be arranged in the same order as the original article. This preserves the author’s intended flow and logic behind the text to summarize, this is even more important in this article given that it discusses the history of home exercise.

Using extractive summarization to gain a quick understanding of an article is very helpful, especially when given the option of choosing how many sentences to retain. This could be powerful to inspire headlines based on the most used words in a text. This project was my first exploration of this method although improvements are needed.

References Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., D., E., B., J., & Kochut, K. (2017). Text Summarization Techniques: A Brief Survey. International Journal of Advanced Computer Science and Applications, 8(10). https://doi.org/10.14569/ijacsa.2017.081052

Erkan, G., & Radev, D. R. (2004). LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. Journal of Artificial Intelligence Research, 22, 457–479. https://doi.org/10.1613/jair.1523

Foong, N. W. (2021, December 13). Extractive Text Summarization Using spaCy in Python. Medium. https://betterprogramming.pub/extractive-text-summarization-using-spacy-in-python-88ab96d1fd97

Gonçalves, L. (2021, December 14). Automatic Text Summarization Made Simple | luisfredgs. Medium. https://medium.com/luisfredgs/automatic-text-summarization-made-simple-with-python-f9c3c645e34a

Joshi, P. (2020, December 23). Automatic Text Summarization Using TextRank Algorithm. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/

Ken, S. M. W. (2021, September 29). How to Start a Summary Paragraph. wikiHow. https://www.wikihow.com/Start-a-Summary-Paragraph

Prasasthy K. B. (2022, January 6). Brief history of Text Summarization - Prasasthy K B. Medium. https://medium.com/@prasasthy.sanal/brief-history-of-text-summarization-9d1b3787a707