The website I have decided to scrape is a website that displays quotes. These quotes are compiled by various famous authors and the website gives a brief history of these authors. The URL for the website is https://quotes.toscrape.com/page/1/. I want to understand trends with these quotes and see if there is any information I can extract about trending authors, quote length, etc.
First I will start by scraping the data off of the website, this will occur on an R script submitted separately. I will also load packages here for data analysis and visualization.
library(readr)
library(purrr)
library(rvest)
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(httr)
library(stringr)
library(ggplot2)
library(tidyr)
These are all of the packages I will need for scraping as well as analyzing the data here in this document.
The data also needs to be loaded in, and can be done so through this link I have uploaded to Microsoft onedrive
quote_data <-
read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/estepa1_xavier_edu/ERyHHgRSC3pHoVPD6-mYsEABh3cvlTCnpKpIA4b2VsL3gw?download=1")
## Rows: 100 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): quote, quote_author, quote_tags
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
quote_data %>%
group_by(quote_author) %>%
summarise(count = n()) %>%
top_n(10) %>%
ggplot(aes(x = reorder(quote_author, count), y = count)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 10 Most Frequently Quoted Authors", x = "Author", y = "Number of Quotes")
## Selecting by count
This visual shows the count of the most frequently cited author of the first 100 quotes. We can see based on this plot that the most common author is Albert Einstein, with a close second being J.K Rowling. This shows that they have had a lot of influence over the culture in recent history.
tag_data <- quote_data %>%
separate_rows(quote_tags, sep = ",\\s*") %>%
group_by(quote_tags) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
slice_head(n = 10)
ggplot(tag_data, aes(x = reorder(quote_tags, count), y = count)) +
geom_col(fill = 'SteelBlue') +
coord_flip() +
labs(title = "Frequency of Tags", x = "Tags", y = "Frequency") +
theme_minimal()
This visual shows the top 10 most used tags in the quotes. We can see that love and life are some of the most common, indicating the positive nature and sentiment of most of these quotes. It would be interesting to conduct a sentiment analysis on these quotes in the future.
quote_data$quote_length <- nchar(as.character(quote_data$quote))
ggplot(quote_data, aes(x = quote_length)) +
geom_histogram(bins = 30, fill = "lightblue", color = "black") +
labs(title = "Distribution of Quote Lengths", x = "Length of Quote (characters)", y = "Frequency")
The data on this website is limited, so I created a 4th variable in the data called quote_length. I used this to count the total characters of the quote. Based on this, we can see the most common length of a quote is around 175 characters. This is reflective of the modern idea of making quotes not too long in order to keep people’s attention. It is also a similar length to an old tweet, which could be a coincidence.
top_authors <- quote_data %>%
group_by(quote_author) %>%
summarise(total_quotes = n()) %>%
arrange(desc(total_quotes)) %>%
slice_head(n = 10) %>%
pull(quote_author)
filtered_data <- quote_data %>%
filter(quote_author %in% top_authors)
ggplot(filtered_data, aes(x = quote_author, y = quote_length)) +
geom_boxplot(fill = "lightblue") +
coord_flip() + # Flip coordinates for better readability
labs(title = "Quote Length Variation by Top 10 Most Quoted Authors",
x = "Author", y = "Quote Length (characters)") +
theme_minimal()
Since we already looked at the top 10 authors, I wanted to see the relationship between them and quote length. I created a box plot that shows this above. Based on this, we can see that Bob Marley has the most variety in his quotes, with some of them being short, and others being relatively long. Dr. Seuss has the smallest distribution, which could be due to his audience.
quote_data <- quote_data %>%
mutate(quote_word_count = sapply(strsplit(as.character(quote), "\\s+"), length))
ggplot(quote_data, aes(x = quote_word_count, y = quote_length)) +
geom_point(alpha = 0.5, color = "blue") +
geom_smooth(method = "lm", color = "red") +
labs(title = "Relationship Between Word Count and Quote Length", x = "Word Count", y = "Quote Length (characters)")
## `geom_smooth()` using formula = 'y ~ x'
Finally, I wanted to see if there were any quotes that used a ton of short or long words. As I mentioned in the last visual, authors Dr. Seuss with shorter quotes could use shorter words. For this, I created a scatter plot to show the relationship between quote length and word count. I added a 5th variable to the data to meet the requirement and conduct the analysis. Unfortunately, there is not much derivation from normal, indicating all authors use a similar word complexity and length.
While it may seem fairly simple, quotes can actually be relatively interesting when examined on a level like this, at a larger scale, we can see trends with culture, attention, influential figures, and more. As more data can be scraped on these quotes, it will be interesting to see if more trends emerge. This kind of analysis is crucial for applications like text analysis, educational tools, or even in studies where the structure and delivery of quotes are important. This dataset serves as a rich source for exploring language elements through the lens of quotes. It offers insights into communication and the impact of words in human culture.