## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'gridExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## 
## 
## Attaching package: 'plotly'
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## The following object is masked from 'package:graphics':
## 
##     layout
## 
## 
## 
## Attaching package: 'scales'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

Introduction

In this post we compare headlines scraped from CalMatters and Capitol Weekly. We compare authors, look at the topics of each site, compare favored days for posting/publishing their stories and compare word usage in the headlines. One of the more interesting findings is that Capitol Weekly seems to favor headlines forcussing on the Legislature while CalMatters includes the Governor in their Headlines more often.

Data Collection

The data was scraped from both the CalMatters website and the Capitol Weekly website on October 8th, 2023. For CalMatters, headlines and info for all stories from each sub category (Politics, Justice, Economy, Education, Housing, Environment, Health, Inequality, and Commentary) were scraped. For Capitol Weekly, headlines and info for most stories (except for 1 page which doesn’t affect this analysis since it was before 2023 and consisted only of stories from 2010) from the News section, and all stories from the Opinion section were scraped.

The headline, author, date, link and topic/section were collected for each story. The topics are not consistent between the website. For CalMatters, the topics are the same as their subcategories (Politics, Justice, Economy, Education, Housing, Environment, Health, Inequality, and Commentary). For Capitol Weekly topics match most of the “Features” on their site (Analysis, Podcast, Opinion, News, Micheli Files, Experts Expound, The Skinny, Photo, Rising Stars, Reporter’s Notebook, Recent News, Big Daddy).

For this Analysis, the data was filtered to only include headlines from stories published so far in the year 2023 for each site. This includes 742 headlines from CalMatters and 338 headlines from Capitol Weekly.

Authors

Let’s start with a look at the authors. First, let’s look at how many authors each site had.

Site Authors
CalMatters 87
Capitol Weekly 144

Despite having fewer stories published, Capitol Weekly is approaching twice as many authors as CalMatters.

Now let’s take a look at the most common Authors for each site.

It looks like “Guest Commentary” is one of the most commone authors for CalMatters. Let’s take out Commentary and Opinion Pieces and see who the most common authors are without those categories.

It looks like that changes up the CalMatters authors in particular, taken out both of their top two authors.

Interestingly there don’t appear to be any overlap with authors, but this could be due to differences in the formatting of the author title.

Topics

This will not be a comparison of topics between the two sites since the topics collected were different for each site. For Capitol Weekly the topics are more like their different features, where as CalMatters’ topics fall more along the lines of what we would expect. Because of this we will first look at CalMatters’ topics and then Capitol Weekly’s without comparing them side by side.

CalMatters

Alright, let’s start with a look at the CalMatters’ topics. We noted earlier that they have 742 headlines as of our analysis for the year 2023. Let’s see how many were in each topic.

It looks like by far CalMatters is publishing Commentary pieces on their site. We’ve include the plot again with Commentary removed to better observe the other categories which are more closely spaced. Interestingly the Economy and Justice topics seem to have significant drop-offs from the other topics.

Capitol Weekly

Now let’s take a look at Capitol Weekly.

Opinion pieces also take the lead at Capitol Weekly, but not by as much at CalMatters. Capitol Weekly also has a longer tail of features with fewer stories such as The Skinny, Rising Stars, Reporter’s Notebook, and Micheli Files.

Days of the week

When do these new sites publish most of their works. Let’s take a look at what day of the week gets the most action!

It looks like at CalMatters, the middle of the week was favored while at Capitol Weekly they like publishing most of their work at the start of the week.

The headlines

Alright, let’s analyze the actual headlines themselves now. Are there any significant differences between these sites in what they cover?

We remove common words (“for”, “the” etc.) from being included as well as names for the state (“california”, “ca”, etc.). Let’s see what the most common words in the headlines are.

## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`

It looks like Capitol Weekly includes a lot of the words from their features, such as “rising” “stars”. They also seem to have covered their own capitol weekly top 100 a lot. Let’s remove these terms.

## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`

It looks like CalMatters includes Gavin Newsom in their headlines a lot more than Capitol Weekly, who seems to focus much more on the legislature and legislative process. CalMatters also uses “crisis” while no word of comparable urgency shows up in Capitol Weekly’s top words, but “action” does.

How much correlation is there between the words in Capitol Weekly’s headlines and CalMatters headlines. With both sites covering California news, particularly California politics, we might expect to see significant correlation, but let’s check. For this, we won’t remove “california” terms or Capitol Weekly’s features.

## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`

It appear from the graph above that there is a fair amount of correlation among the terms used in the headlines of these two sites.

We can see on the Capitol Weekly side there are some words that don’t correlate well with CalMatters headlines, such as “rising” and “100”. As mentioned before, these terms are specific to features or awards given by Capitol Weekly so it makes sense that they would be outliers.

However, generally we see a good amount of correlation with “California” and “California’s” being up in the top corner of the most used by both sites. Towards the middle/lower right corner, where the bulk of the words are we see common terms like “homeless”, “safety”, “senate”, and “uc” being pretty close to evenly used by both sites.

## 
##  Pearson's product-moment correlation
## 
## data:  CalMatters and Capitol Weekly
## t = 28.223, df = 475, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7553772 0.8227949
## sample estimates:
##       cor 
## 0.7914814

Here we can get the Pearson Correlation Coefficient of the terms in the headlines of these site, and we can see that it’s ~0.79. A significantly high correlation suggesting that generally these sites are covering similar topics with similar headlines words being used in the headlines.

Summary

In closing We found that despite having fewer stories published this year, Capitol Weekly published stories by more authors than CalMatters. While the topics scraped from these two sites aren’t perfectly comparable, both sites seem to favor Commentary or Opinion pieces. While Capitol Weekly publishes most of their stories at the start of the week, CalMatters publishes most of their stories in the middle of the week. Both sites have significant correlation of words in their headlines, but CalMatters seems to favor headlines about the Governor while Capitol Weekly spends more of their headline space mentioning the legislature.

Join me next week when I might perform sentiment analysis on the Commentary/Opinion pieces of these sites for 2023 and compare the results, maybe, we’ll see.