Are Web Publications Padding Their Articles For Ad Engagement?
My final project is an investigation into the question of if online web publications have been needlessly padding the length of articles in order to increase ad revenue.
Background
The inspiration for this project was comment thread on a Wired.com article posted on a news aggregating site called Hacker News.
The article itself was titled Cars Are Going Electric. What Happens to the Used Batteries?. The title of the article poses a relatively clear question which one would think could be answered with a relatively short text but this article contains around 2,300 words.
The two main comments are quoted below
idrisser These articles are insane, 10min in until you get to the part of what happens to used batteries. I’m really thinking the media business model is totally broken for these authors to not go straight to the point
drelau I do wonder sometimes if at least some articles are so long just to be able to fit more ad banners within a single article. I turned off the content blocker and counted 6 ads within the length of this article.
While I also had the feeling that online articles were including large amounts of off-topic information, these two comments put my feelings directly into words.
My analysis, due mostly to time constraints, is focuses strictly on attempting to measure an increase in article length and if articles are less focused on their subjects overtime.
Data Sources
My primary data source was the online news website Wired.com1. Wired.com, initially founded as a print magazine, has been operating an online technology focuses web publication site since 1994. I chose Wired.com for two reasons, because they have a their entire backlog of articles going back to 1993 easily accessible via their site map and two because one of their articles was the initial inspiration for this project.
Approach
My approach for this project consisted of the following:
- Use web scraping in R to obtain article data from Wired.com
- Use API calls to NYT Open API to obtain secondary article data
- Tidy the article data
- Store the data in a SQL database for long-term storage and retrieval
- Perform analysis on the article data
Web Scraping With R
Web scraping code can be found here
To scrape the article data from the Wired.com website I primarily used the `rvest`` library.
The scraping process was broken into two distinct steps.
Sitemap Scraping
First was reading their sitemap, which conveniently was a very simple page containing a list of links from 2021 back to 1992.
Wired.com Sitemap
Each link in the sitemap then directs you to a page that contains links to the actual articles.
Due to the large amount of articles scraping every article would be impractical and would likely result in a block or ban of some sort from the Wired.com’s team. For that reason I chose to sample a range of articles from each year that the Wired.com published articles.
Reading the links from the sitemap only required a simple rvest call which I have below
read_html(siteMapUrl) %>%
html_nodes(".sitemap__section-archive > ul > li > a")To obtain a reasonable subset of the articles I simply used the sample function on the top-level archive links like so
sample(archiveLinkNodes, 150)I then verified that I had at least one link for every year using the following code to extract the year from the URL, list the unique years, and sort for easier verification.
sampledArchiveLinks %>%
stringi::stri_extract_all_regex("(?<=year=)[0-9]+") %>%
unlist() %>%
unique() %>%
parse_number() %>%
sort()Finally I was left with a simple dataframe which contained a link to an article and the archive source. Here is a sample of what that data looks like for demonstration purposes.
linksDf <- read.csv('final_project_sitemap.csv', nrows=5)
kable(linksDf)Scraping Articles Contents
The next step was to scrape the contents of the articles. Again I used the rvest library to do the work here.
Scraping the article contents involved more in-depth analysis of the HTML that made up an article. Luckily Wired.com does not invoke any anti-scraping measures and uses relatively simple strucute to display their articles.
For example to read the body of the article I simply had to select the .body__inner-container class.
articleBody <- html %>%
html_node(".body__inner-container") %>%
html_text() %>%
convertToEmptyString()To streamline the processing of the articles I created a function which took in the html of the article, parsed the contents, and returned a vector containing the contents of the article.
extractArticleContent <- function(html) {
...
c(publishDate, author, subject, byLine, category, articleBody)
}I identified 6 distinct parts the article to read and save.
- Publish date
- Author
- Subject
- By Line
- Category
- Body
To prevent hosting Wired.com’s article content from my private GitHub I did not include any CSV’s of the scraped content in my repository but here is a screenshot of the Database with some articles selected for demostration purposes
Handling Failure & Avoiding Getting Banned
Because web scraping involves a lot of calls over the internet it has a high potential of failure. In order to keep the process as resilient as possible I used the Try/Catch/Finally feature of the R language to handle failures.
For those not familiar with the Try/Catch/Finally paradigm, you can understand it by breaking it down by each word. First you put code inside a tryCatch block which you want your R code to attempt to execute. If there is any error for any reason your error function will “catch” the error and allow you to do some debugging or rollback with that error. The finally function, which is option, always executes regardless of if there was an error or not. This allows you to do things such as closing a database connection or like in my case sleeping. While the R language implements Try/Catch/Finally differently than some other languages I am familiar with the paradigm is still the same.
In addition I utilized my Database to prevent re-processing duplicates by selecting only links which haven’t already been saved in my Database.
The following code is what I used with some parts removed for brevity
#Open DB connection
...
#Get list of articles we haven't retrieved yet
links <- dbGetQuery(con, "select lx.link
from LinksXref lx
left join Articles a on a.sourceUrl = lx.link
where a.sourceUrl is null;")
#Create empty DataFrame to store articles with columns that match our DB table
...
#Grab random sample of 100 articles
for(article in sample(links$link,100)){
tryCatch(expr={
html <- read_html(article)
data <- extractArticleContent(html)
# Add data to DF
...
},error=function(e){
message(paste("Failed on article ", article))
message(e)
},finally = {
message(paste("Processed URL:", article))
# Sleep for a random number of seconds to avoid banning
Sys.sleep(sample(1:5, 1))
})
}
#Write articles out to table
dbWriteTable(con,"Articles",articlesDf, append=TRUE)As you can see above I also included a random sleep interval in the finally block to prevent a large number of requests hitting Wired.com’s servers in a short amount of time. The intention is to stay under the radar and prevent setting off any alarms that might be triggered by a high volume of requests from a single IP address source.
In addition to the sleep timer I utilized a paid Virtual Private Network (VPN) service that I already subscribe to in order to randomize my IP address.
Data Storage
Full code for creating DB and tables here
For data storage I ended up using SQLite as my database. I tried using Azure’s cloud hosted SQL storage but unfortunately was unable to connect using R likely due to my PC’s usage of the relatively new M1 Arm based Apple processor.
The database structure was kept very simple. It consists of two tables which I have a relationship model below
For creating the DB, tables, and querying the DB I used the DBI library.
Here is an example of how I inserted the articles into the Articles table using the dbWriteTable command.
dbWriteTable(con,"Articles",articlesDf, append=TRUE)Data Manipulation/Tidying
Luckily the data I was able to scrape from the Wired.com was very tidy and required little manipulation in order to be used for my analysis. I did run into a few areas where I needed to handle some quirks of how the articles were strucuted though.
After scraping a large number of articles I looked at my database and realized that I had records with no body but an enormous Category field. After looking into it I found that some articles did not have a category which caused the remainder of the scraped data to be shifted over into the wrong fields.
In order to handle this I added a simple function which converted empty fields to NA
missingConvert <- function(variable){
if(length(variable) == 0) NA else variable
}In addition, I found that some articles were simply devoid of a body altogether due to some publishing error on Wired.com’s side or some other reason.
Here is an example screenshot of one such article
To handle this I filtered out articles from analysis which had a body that was empty or contained less than 100 words.
Analysis
Full analysis source code can be found here
My analysis of the article data was also broken into two steps.
The first step consisted of tokenizing the bodies of the articles into words over the years. Then counting the total number of words in the articles for a year, the total number of unique words for a year, and the average number of words per article per year.
Because of the random nature of my sampling I had an uneven amount of articles for a given year e.g. for 2012 I had 248 articles but for 1996 I only had 10. In order to control for this I made sure to analyze my data by randomly selecting the same number of articles from each year.
minSampleSize <- wiredArticlesDf %>%
group_by(publishYear) %>%
count() %>%
filter(n >= 20) %>%
min()
articlesGrouped <- wiredArticlesDf %>%
group_by(publishYear) %>%
#filter(n() > minSampleSize) %>%
slice_sample(n = minSampleSize) #tempThe second part of my analysis utilized the natural language processing technique known as Latent Dirichlet allocation (LDA). Due to my unfamiliarity with LDA I borrowed heavily from the Text Mining with R2 book.
According to the Text Mining with R book, LDA is one of the most common topic modeling algorithms in use today. My theory is that I could use LDA’s functionality to see how well the words contained in each article were matched to the LDA algoritm’s automatically generated topics. The better the match to the topics, the more on-topic the articles were.
As you increase the count of articles and subsequent topics, the visualizations become less valuable as you can see with this graph of just 20 articles and topics.
To attempt to quantify the on-topic-ness over time, I ran LDA on a sample set of articles grouped by category over the years and then saved off the number of words that were assigned to the wrong article. I then graphed the mean of the count of wrong words over time. I toyed with the sample size but settled on 10 articles and topics and 10 iterations.
Conclusion & Reflections
Using the data I was able to obtain from the Wired.com I believe I was able to provide solid evidence that articles have been getting longer overtime. The word count metrics I used showed an obvious upward trend across all three metrics.
While my usage of LDA did not provide as solid of evidence I believe the results at least point in a promising direction and would warrant further investigation.
There is some external evidence to support my conclusions as well. There are numerous articles pointing out the fall in revenue from print such as this Digital Revenue Exceeds Print for 1st Time for New York Times Company and this article from Forbes From Scams To Mainstream Headlines, Clickbait Is On The Rise
Below is a list of things I would like to look into further or change
- Using a separate data source
- My intention was to use the New York Times Open API as a secondary data source but I had to forgo it in the interest of time.
- Dive deeper into LDA and investigate if its application was appropriate in this context
- Research other methods of determine how on topic an article’s is related to it’s subject
- Do more filtering by category and fine tuning of sample sizes
- Collect more data to improve the analysis
- I was only able to collect a small subset of the articles that the Wired.com has posted on their website. Collecting more data would improve the quality of my analysis
- Vet the data more closely
- Due to the time constraints and the inherently messy nature of web scraping it is highly likely I have more bad data than I was able to identify in my analysis.
References
Nast, C. (n.d.). The latest in technology, Science, Culture and Business. Wired. Retrieved December 5, 2021, from https://www.wired.com/.↩︎
Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly Media.↩︎