Animal Refuge League of Greater Portland (ARLGP) News Stories

I utilized the web scraping feature to find out more about the various news outlets and stories that the ARLGP has featured on their website (found under the Press section of their News & Events tab).

I chose this as my background is in Animal Science and I was curious who the ARLGP partnered with most on news stories. Keeping in mind ethical data scraping, I thought this was acceptable for a few reasons:

Data Scraped

I collected two sets of values for my purposes, listed in the following code chunk:

news_title <- arlgp_news %>% html_nodes("h4+ p a") %>% html_text()
dayofweek <- arlgp_news %>% html_nodes("h4+ p > strong") %>% html_text()

The first data set contained the hyperlinked articles on the web page, which included both the news outlet that ran the story and the title of the story itself. This was separated into two columns, one for the News Outlet, and one for the title of the article.

news_story <- colsplit(news_title, ": ", names = c("Outlet", "Story_Name")) %>% drop_na()

The other data set was the day of the week and the date in dd/mm/yy format. Similarly to the first data set, I wanted to separate day of the week and date to make it easier to navigate.

dayofweek <- dayofweek %>% str_replace_all("\\:", "") 
day_and_time <- colsplit(dayofweek, "\\, ", names = c("Day_of_Week", "Date"))

Finally, I combined the two date sets into one final table that provided me with the day of the week, date, news outlet, and title for each news story posted to the page. This is illustrated in the table below.

Table: ARLGP News Features

news_table <- cbind(day_and_time, news_story) %>% slice(-c(62:63, 129:132, 136:137,142, 159, 163:165, 169, 171, 178, 180, 190))
news_table %>% top_n(20) %>% kable() 
## Selecting by Story_Name
Day_of_Week Date Outlet Story_Name
Saturday 9/19/20 News Center Maine Young Mainer bouncing for 24 hours to raise money for the ARLGP
Thursday 7/23/20 WGME The importance of foster families during the pandemic
Thursday 5/14/20 Maine Public Veterinary news
Friday 4/24/20 Portland Press Herald To lift spirits, ARL holds ‘hootenanny’ drive-by parade
Wednesday 2/26/20 News Center Maine The many benefits of puppy yoga
Monday 2/3/20 News Center Maine The sweet sounds of a violin make it feel more like home
Sunday 2/2/20 News Center Maine The winner of Puppy Bowl XVI is Team Fluff with the help of Maine Collie
Tuesday 12/24/19 WMTW Westbrook house shows off lights for good cause
Monday 10/14/19 WGME Terminally ill dog adopted, ready for ‘bucket list’ adventures
Wednesday 8/21/19 WMTW Some of 100 animals seized from Maine home ready for adoption
Monday 6/3/19 American Journal Together Days revisits 1980s
Tuesday 4/23/19 Forbes The Maine Thing: How This Inn by the Sea Can Seduce You
Wednesday 3/13/19 The Forecaster USM’s ‘Applause for Paws’ concert inspired by rescue animals
Saturday 8/25/18 NECN We loved him right from day one: Maine family adopts dog after fostering
Sunday 7/8/18 News Center Maine Tips to keep your pet from running away during fireworks
Tuesday 7/3/18 Kennebec Journal Westbrook neighborhood’s wandering cat Simba, a high school regular, dies
Friday 2/2/18 American Journal Westbrook puppy touches down on national stage
Friday 10/13/17 WGME Traveling dog wash boosts local animal shelter
Thursday 6/15/17 Keep Me Current TD Affinity Membership Program
Thursday 4/6/17 Keep Me Current Tees get A+ as animal shelter fundraiser

With this information now in a table, I wanted to find out what news outlets featured the most ARLGP stories, so I could adjust my viewing habits if I wanted to see more of these stories as they come out.

Graph 1

stories_by_outlet <- news_table %>% group_by(Outlet) %>% summarize(count = n()) %>% arrange(desc(count)) %>%
  top_n(5)
ggplot(data = stories_by_outlet, aes(x = fct_reorder(Outlet, count, .desc = TRUE), y = count)) +
  geom_bar(stat = "identity", fill = "red3") +
  xlab("Outlet") +
  ggtitle("News Outlets that Cover ARLGP Most")

WGME clearly covers the ARLGP most based on the stories featured on their website.

Now that I knew what news outlet to view if I wanted the most stories featuring the ARLGP, I decided I wanted to know what days of the week would be best to tune in. I performed a similar filter to see what days of the week these storeis were aired most often.

Graph 2

stories_by_dayofweek <- news_table %>% group_by(Day_of_Week) %>% summarize(count = n()) %>% arrange(desc(count)) %>% top_n(5)
ggplot(data = stories_by_dayofweek, aes(x = fct_reorder(Day_of_Week, count, .desc = FALSE), y = count)) +
  geom_bar(stat = "identity", fill = "red3") +
  xlab("Day of Week") +
  geom_text(aes(label = count, hjust = "left")) +
  coord_flip() +
  ggtitle("ARLGP Coverage by Day of Week")

Tuesday or Saturday appears to be the best days to tune in, followed by Wednesday or Thursday.

Discussion and Issues

I found this exercise to be very interesting as it allowed me to pull data from a webpage and present it in a more easily digestible manner. I could have gone through and manually counted the number of articles from each day of the week, or made tallies of what news outlets appeared the most, but this would have been very time consuming and left room for errors.

However, while a lot of information was gleaned from this, it is in many ways incomplete.

There were several areas that caused issues and altered the end result with the graphs, so these conclusions are almost certainly not 100% accurate. A few of the issues were as follows: