I utilized the web scraping feature to find out more about the various news outlets and stories that the ARLGP has featured on their website (found under the Press section of their News & Events tab).
I chose this as my background is in Animal Science and I was curious who the ARLGP partnered with most on news stories. Keeping in mind ethical data scraping, I thought this was acceptable for a few reasons:
I collected two sets of values for my purposes, listed in the following code chunk:
news_title <- arlgp_news %>% html_nodes("h4+ p a") %>% html_text()
dayofweek <- arlgp_news %>% html_nodes("h4+ p > strong") %>% html_text()
The first data set contained the hyperlinked articles on the web page, which included both the news outlet that ran the story and the title of the story itself. This was separated into two columns, one for the News Outlet, and one for the title of the article.
news_story <- colsplit(news_title, ": ", names = c("Outlet", "Story_Name")) %>% drop_na()
The other data set was the day of the week and the date in dd/mm/yy format. Similarly to the first data set, I wanted to separate day of the week and date to make it easier to navigate.
dayofweek <- dayofweek %>% str_replace_all("\\:", "")
day_and_time <- colsplit(dayofweek, "\\, ", names = c("Day_of_Week", "Date"))
Finally, I combined the two date sets into one final table that provided me with the day of the week, date, news outlet, and title for each news story posted to the page. This is illustrated in the table below.
news_table <- cbind(day_and_time, news_story) %>% slice(-c(62:63, 129:132, 136:137,142, 159, 163:165, 169, 171, 178, 180, 190))
news_table %>% top_n(20) %>% kable()
## Selecting by Story_Name
Day_of_Week | Date | Outlet | Story_Name |
---|---|---|---|
Saturday | 9/19/20 | News Center Maine | Young Mainer bouncing for 24 hours to raise money for the ARLGP |
Thursday | 7/23/20 | WGME | The importance of foster families during the pandemic |
Thursday | 5/14/20 | Maine Public | Veterinary news |
Friday | 4/24/20 | Portland Press Herald | To lift spirits, ARL holds ‘hootenanny’ drive-by parade |
Wednesday | 2/26/20 | News Center Maine | The many benefits of puppy yoga |
Monday | 2/3/20 | News Center Maine | The sweet sounds of a violin make it feel more like home |
Sunday | 2/2/20 | News Center Maine | The winner of Puppy Bowl XVI is Team Fluff with the help of Maine Collie |
Tuesday | 12/24/19 | WMTW | Westbrook house shows off lights for good cause |
Monday | 10/14/19 | WGME | Terminally ill dog adopted, ready for ‘bucket list’ adventures |
Wednesday | 8/21/19 | WMTW | Some of 100 animals seized from Maine home ready for adoption |
Monday | 6/3/19 | American Journal | Together Days revisits 1980s |
Tuesday | 4/23/19 | Forbes | The Maine Thing: How This Inn by the Sea Can Seduce You |
Wednesday | 3/13/19 | The Forecaster | USM’s ‘Applause for Paws’ concert inspired by rescue animals |
Saturday | 8/25/18 | NECN | We loved him right from day one: Maine family adopts dog after fostering |
Sunday | 7/8/18 | News Center Maine | Tips to keep your pet from running away during fireworks |
Tuesday | 7/3/18 | Kennebec Journal | Westbrook neighborhood’s wandering cat Simba, a high school regular, dies |
Friday | 2/2/18 | American Journal | Westbrook puppy touches down on national stage |
Friday | 10/13/17 | WGME | Traveling dog wash boosts local animal shelter |
Thursday | 6/15/17 | Keep Me Current | TD Affinity Membership Program |
Thursday | 4/6/17 | Keep Me Current | Tees get A+ as animal shelter fundraiser |
With this information now in a table, I wanted to find out what news outlets featured the most ARLGP stories, so I could adjust my viewing habits if I wanted to see more of these stories as they come out.
stories_by_outlet <- news_table %>% group_by(Outlet) %>% summarize(count = n()) %>% arrange(desc(count)) %>%
top_n(5)
ggplot(data = stories_by_outlet, aes(x = fct_reorder(Outlet, count, .desc = TRUE), y = count)) +
geom_bar(stat = "identity", fill = "red3") +
xlab("Outlet") +
ggtitle("News Outlets that Cover ARLGP Most")
WGME clearly covers the ARLGP most based on the stories featured on their website.
Now that I knew what news outlet to view if I wanted the most stories featuring the ARLGP, I decided I wanted to know what days of the week would be best to tune in. I performed a similar filter to see what days of the week these storeis were aired most often.
stories_by_dayofweek <- news_table %>% group_by(Day_of_Week) %>% summarize(count = n()) %>% arrange(desc(count)) %>% top_n(5)
ggplot(data = stories_by_dayofweek, aes(x = fct_reorder(Day_of_Week, count, .desc = FALSE), y = count)) +
geom_bar(stat = "identity", fill = "red3") +
xlab("Day of Week") +
geom_text(aes(label = count, hjust = "left")) +
coord_flip() +
ggtitle("ARLGP Coverage by Day of Week")
Tuesday or Saturday appears to be the best days to tune in, followed by Wednesday or Thursday.
I found this exercise to be very interesting as it allowed me to pull data from a webpage and present it in a more easily digestible manner. I could have gone through and manually counted the number of articles from each day of the week, or made tallies of what news outlets appeared the most, but this would have been very time consuming and left room for errors.
However, while a lot of information was gleaned from this, it is in many ways incomplete.
There were several areas that caused issues and altered the end result with the graphs, so these conclusions are almost certainly not 100% accurate. A few of the issues were as follows:
It was very difficult to merge the News Story and Day/Date tables because they had varying numbers of rows. After spending a lot of time trying to diagnose the issue, I discovered that the extra two rows in the Day/Date table occurred when the SelectorGadget tool identified random spaces in the mappings. Once these were removed, it resolved the discrepancy.
When performing the colsplit() function, some of the data split in unusual or unexpected ways, with both sets of data frames. After merging, I had to go through and manually slice out rows with incomplete or misformated data, as some of the rows had empty spaces instead of no entry making it difficult to accomplish with a function designed to remove N/A values. This reduced the data set from 190 rows to 172.
Initially when creating the Day of Week chart, I was provided a list of overlapping days (two values for Thursday, for example). This is likely a result of the column splits we did before that caused the issues in the previous bullet. For the purposes of this assignment, I filtered out only the top 5 results to get the chart shown. This is why the total count is 91, less than the 172 articles considered.