Assignment 6

Animal Refuge League of Greater Portland (ARLGP) News Stories

I utilized the web scraping feature to find out more about the various news outlets and stories that the ARLGP has featured on their website (found under the Press section of their News & Events tab).

I chose this as my background is in Animal Science and I was curious who the ARLGP partnered with most on news stories. Keeping in mind ethical data scraping, I thought this was acceptable for a few reasons:

Scraping historical data rather than current information
It does not promote an alternative service or take away from ARLGP’s ability to operate normally
May even provide more awareness of ARLGP
Unlikely to cause disruption to web servers due to the type and quantity of information being scraped

Data Scraped

I collected two sets of values for my purposes, listed in the following code chunk:

news_title <- arlgp_news %>% html_nodes("h4+ p a") %>% html_text()
dayofweek <- arlgp_news %>% html_nodes("h4+ p > strong") %>% html_text()

The first data set contained the hyperlinked articles on the web page, which included both the news outlet that ran the story and the title of the story itself. This was separated into two columns, one for the News Outlet, and one for the title of the article.

news_story <- colsplit(news_title, ": ", names = c("Outlet", "Story_Name")) %>% drop_na()

The other data set was the day of the week and the date in dd/mm/yy format. Similarly to the first data set, I wanted to separate day of the week and date to make it easier to navigate.

dayofweek <- dayofweek %>% str_replace_all("\\:", "") 
day_and_time <- colsplit(dayofweek, "\\, ", names = c("Day_of_Week", "Date"))

Finally, I combined the two date sets into one final table that provided me with the day of the week, date, news outlet, and title for each news story posted to the page. This is illustrated in the table below.

Table: ARLGP News Features

news_table <- cbind(day_and_time, news_story) %>% slice(-c(62:63, 129:132, 136:137,142, 159, 163:165, 169, 171, 178, 180, 190))
news_table %>% top_n(20) %>% kable()

## Selecting by Story_Name

Day_of_Week	Date	Outlet	Story_Name
Saturday	9/19/20	News Center Maine	Young Mainer bouncing for 24 hours to raise money for the ARLGP
Thursday	7/23/20	WGME	The importance of foster families during the pandemic
Thursday	5/14/20	Maine Public	Veterinary news
Friday	4/24/20	Portland Press Herald	To lift spirits, ARL holds ‘hootenanny’ drive-by parade
Wednesday	2/26/20	News Center Maine	The many benefits of puppy yoga
Monday	2/3/20	News Center Maine	The sweet sounds of a violin make it feel more like home
Sunday	2/2/20	News Center Maine	The winner of Puppy Bowl XVI is Team Fluff with the help of Maine Collie
Tuesday	12/24/19	WMTW	Westbrook house shows off lights for good cause
Monday	10/14/19	WGME	Terminally ill dog adopted, ready for ‘bucket list’ adventures
Wednesday	8/21/19	WMTW	Some of 100 animals seized from Maine home ready for adoption
Monday	6/3/19	American Journal	Together Days revisits 1980s
Tuesday	4/23/19	Forbes	The Maine Thing: How This Inn by the Sea Can Seduce You
Wednesday	3/13/19	The Forecaster	USM’s ‘Applause for Paws’ concert inspired by rescue animals
Saturday	8/25/18	NECN	We loved him right from day one: Maine family adopts dog after fostering
Sunday	7/8/18	News Center Maine	Tips to keep your pet from running away during fireworks
Tuesday	7/3/18	Kennebec Journal	Westbrook neighborhood’s wandering cat Simba, a high school regular, dies
Friday	2/2/18	American Journal	Westbrook puppy touches down on national stage
Friday	10/13/17	WGME	Traveling dog wash boosts local animal shelter
Thursday	6/15/17	Keep Me Current	TD Affinity Membership Program
Thursday	4/6/17	Keep Me Current	Tees get A+ as animal shelter fundraiser

With this information now in a table, I wanted to find out what news outlets featured the most ARLGP stories, so I could adjust my viewing habits if I wanted to see more of these stories as they come out.

Graph 1

stories_by_outlet <- news_table %>% group_by(Outlet) %>% summarize(count = n()) %>% arrange(desc(count)) %>%
  top_n(5)
ggplot(data = stories_by_outlet, aes(x = fct_reorder(Outlet, count, .desc = TRUE), y = count)) +
  geom_bar(stat = "identity", fill = "red3") +
  xlab("Outlet") +
  ggtitle("News Outlets that Cover ARLGP Most")

WGME clearly covers the ARLGP most based on the stories featured on their website.

Now that I knew what news outlet to view if I wanted the most stories featuring the ARLGP, I decided I wanted to know what days of the week would be best to tune in. I performed a similar filter to see what days of the week these storeis were aired most often.

Graph 2

stories_by_dayofweek <- news_table %>% group_by(Day_of_Week) %>% summarize(count = n()) %>% arrange(desc(count)) %>% top_n(5)
ggplot(data = stories_by_dayofweek, aes(x = fct_reorder(Day_of_Week, count, .desc = FALSE), y = count)) +
  geom_bar(stat = "identity", fill = "red3") +
  xlab("Day of Week") +
  geom_text(aes(label = count, hjust = "left")) +
  coord_flip() +
  ggtitle("ARLGP Coverage by Day of Week")

Tuesday or Saturday appears to be the best days to tune in, followed by Wednesday or Thursday.

Discussion and Issues

I found this exercise to be very interesting as it allowed me to pull data from a webpage and present it in a more easily digestible manner. I could have gone through and manually counted the number of articles from each day of the week, or made tallies of what news outlets appeared the most, but this would have been very time consuming and left room for errors.

However, while a lot of information was gleaned from this, it is in many ways incomplete.

There were several areas that caused issues and altered the end result with the graphs, so these conclusions are almost certainly not 100% accurate. A few of the issues were as follows:

It was very difficult to merge the News Story and Day/Date tables because they had varying numbers of rows. After spending a lot of time trying to diagnose the issue, I discovered that the extra two rows in the Day/Date table occurred when the SelectorGadget tool identified random spaces in the mappings. Once these were removed, it resolved the discrepancy.
When performing the colsplit() function, some of the data split in unusual or unexpected ways, with both sets of data frames. After merging, I had to go through and manually slice out rows with incomplete or misformated data, as some of the rows had empty spaces instead of no entry making it difficult to accomplish with a function designed to remove N/A values. This reduced the data set from 190 rows to 172.
Initially when creating the Day of Week chart, I was provided a list of overlapping days (two values for Thursday, for example). This is likely a result of the column splits we did before that caused the issues in the previous bullet. For the purposes of this assignment, I filtered out only the top 5 results to get the chart shown. This is why the total count is 91, less than the 172 articles considered.