library(tidyverse)
library(rvest)
This is the Rmarkdown for the British Airway data analysis task submission task 1. Task 1 is focused on scraping and collecting customer feedback and reviewing data from a third-party source and analyzing this data to present insights I uncover.
I will utilize sentiment analysis technique to analyze customer reviews and construct one-slide deck as a final product.
Setting up Skytrax customer review web scraping parameters. I will crape 3K reviews from the website.
base_url <- "https://www.airlinequality.com/airline-reviews/british-airways/page/"
pages <- 30
page_size <- 100
review_content <- c(rep(NA, pages* page_size))
review_title <- c(rep(NA, pages* page_size))
Using Rvest package to scrape 3K most recent reviews about British Airways. Will also check if any scrapes contain missing review to ensure data hygine.
## [1] "Finish reading page 1 currently with 100 reviews"
## [1] "Finish reading page 2 currently with 200 reviews"
## [1] "Finish reading page 3 currently with 300 reviews"
## [1] "Finish reading page 4 currently with 400 reviews"
## [1] "Finish reading page 5 currently with 500 reviews"
## [1] "Finish reading page 6 currently with 600 reviews"
## [1] "Finish reading page 7 currently with 700 reviews"
## [1] "Finish reading page 8 currently with 800 reviews"
## [1] "Finish reading page 9 currently with 900 reviews"
## [1] "Finish reading page 10 currently with 1000 reviews"
## [1] "Finish reading page 11 currently with 1100 reviews"
## [1] "Finish reading page 12 currently with 1200 reviews"
## [1] "Finish reading page 13 currently with 1300 reviews"
## [1] "Finish reading page 14 currently with 1400 reviews"
## [1] "Finish reading page 15 currently with 1500 reviews"
## [1] "Finish reading page 16 currently with 1600 reviews"
## [1] "Finish reading page 17 currently with 1700 reviews"
## [1] "Finish reading page 18 currently with 1800 reviews"
## [1] "Finish reading page 19 currently with 1900 reviews"
## [1] "Finish reading page 20 currently with 2000 reviews"
## [1] "Finish reading page 21 currently with 2100 reviews"
## [1] "Finish reading page 22 currently with 2200 reviews"
## [1] "Finish reading page 23 currently with 2300 reviews"
## [1] "Finish reading page 24 currently with 2400 reviews"
## [1] "Finish reading page 25 currently with 2500 reviews"
## [1] "Finish reading page 26 currently with 2600 reviews"
## [1] "Finish reading page 27 currently with 2700 reviews"
## [1] "Finish reading page 28 currently with 2800 reviews"
## [1] "Finish reading page 29 currently with 2900 reviews"
## [1] "Finish reading page 30 currently with 3000 reviews"
## [1] "Missing review text content is 0 % of data"
## [1] "Missing review text content is 0 % of data"
Once data is scraped, I will clean the data and store them in a tidy format: with one review per row, and cleaning up the text to remove unnecessary information like emoji and parse prefix “Trip Verified” as a different column for later analysis use.
df <- tibble(
id = seq(1:length(review_content)),
review_title = review_title,
review_content = review_content
) %>%
mutate(
is_verified_trip = str_detect(review_content, 'Trip Verified'),
review_text = str_remove(string = review_content,
pattern = regex('[:emoji:] trip verified \\| ',
ignore_case = T))
)
glimpse(df)
## Rows: 3,000
## Columns: 5
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ review_title <chr> "Lots of cancellations and delays", "Overall, very ha…
## $ review_content <chr> "✅ Trip Verified | Lots of cancellations and delays …
## $ is_verified_trip <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ review_text <chr> "Lots of cancellations and delays and no one apologiz…
Now data is in tidy format, I will conduct basic EDA to look at the share of reviews from verified trip vs non-verified trip.
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
#Number of reviews are verified
df %>%
group_by(is_verified_trip) %>%
summarise(n = n()) %>%
mutate(pct = n/sum(n)) %>%
ggplot(aes(x = is_verified_trip, y = pct)) +
geom_col(aes(fill = is_verified_trip)) +
scale_y_continuous(label = scales::percent) +
geom_label(aes(label = paste0(round(pct *100), '%'), x = is_verified_trip , y = pct + 0.05 )) +
labs(x = 'Is Trip Verified', y = '% of record', title = '% of review are verified trip')
I also looked at the top words used in reviews to get high level info on
what users usually talk about when they review British Airways.
library(tidytext)
library(textdata)
library(SnowballC)
tokenized_df <- df %>%
unnest_tokens(input = 'review_text', output = 'word') %>%
anti_join(stop_words) %>%
filter(!str_detect( word, "([:digit:]|[:punct:])"))
## Joining, by = "word"
#stemming words
tokenized_df_stemmed <- tokenized_df %>%
mutate(word_stemmed = wordStem(word))
#top words
tokenized_df_stemmed %>%
count(word_stemmed) %>%
mutate(pct = n/sum(n)) %>%
arrange(desc(n)) %>%
head(20) %>%
ggplot() +
geom_col(aes(x = reorder(word_stemmed, n), y = pct)) +
scale_y_continuous(label = scales::percent) +
labs(x = 'Top words in review text', y = 'Word prevalence (% of all words in review text)',
title = 'Top words used in user reviews') +
coord_flip()
Also breaking down the top words used in reviews and compare if there
are differences between reviews from verified trips vs non-verified
trips.
#top words by verification status
tokenized_df_stemmed %>%
group_by(is_verified_trip, word_stemmed) %>%
summarise(
n = n(),
.groups = 'drop'
) %>%
ungroup() %>%
group_by(is_verified_trip) %>%
mutate(
pct = n/sum(n),
rank = dense_rank(desc(n)),
) %>%
filter(rank <= 20) %>%
ggplot() +
geom_col(aes(x = reorder(word_stemmed, n), y = pct, fill = is_verified_trip)) +
labs(x = 'Top words in review text', y = 'Word prevalence (% of all words in review text)',
title = 'Top words used in user reviews by trip verification status') +
scale_y_continuous(label = scales::percent) +
coord_flip() +
facet_wrap(~ is_verified_trip, scales = 'free')
## Conduct sentiment analysis from customer reviews
I use sentiment analysis technique to extract sentiments from customer reviews to understand the overall sentiments from customers.
tokenized_df_stemmed %>%
inner_join(get_sentiments('bing')) %>%
count(sentiment) %>%
mutate(pct = n/sum(n)) %>%
ggplot(aes(x = sentiment, y = pct)) +
geom_col(aes(fill = sentiment)) +
scale_y_continuous(labels = scales::percent) +
geom_label(aes(label = paste0(round(pct*100), '%'), x = sentiment, y = pct + 0.02)) +
labs(
x = 'Sentiment',
y = 'Prevalence by word used',
title = 'Overall word sentiment used in reviews'
)
## Joining, by = "word"
tokenized_df_stemmed %>%
inner_join(get_sentiments('bing')) %>%
group_by(is_verified_trip, sentiment) %>%
summarise(n = n()) %>%
group_by(is_verified_trip) %>%
mutate(
pct = n/sum(n),
is_verified_trip = ifelse(is_verified_trip, 'Verified Trip', 'Not Verified Trip')
) %>%
ggplot(aes(x = sentiment, y = pct)) +
geom_col(aes(fill = sentiment)) +
scale_y_continuous(labels = scales::percent) +
geom_label(aes(label = paste0(round(pct*100), '%'), x = sentiment, y = pct + 0.02)) +
labs(
x = 'Sentiment',
y = 'Prevalence by word used',
title = 'Overall word sentiment used in reviews'
) +
facet_wrap(~ is_verified_trip)
## Joining, by = "word"
## `summarise()` has grouped output by 'is_verified_trip'. You can override using
## the `.groups` argument.
Checking the top words used associated with positive vs negative
sentiment to understand the context.
#top words used in positive vs negative sentiments
tokenized_df_stemmed %>%
inner_join(get_sentiments('bing')) %>%
group_by(sentiment, word) %>%
summarise(
n = n()
) %>%
group_by(sentiment) %>%
mutate(
pct = n/sum(n),
rank = dense_rank(desc(pct))
) %>%
filter(rank <= 20) %>%
ggplot() +
geom_col(aes(x = reorder(word, n), y = pct, fill = sentiment)) +
scale_y_continuous(label = scales::percent) +
labs(y = '% word used in reviews', x = 'Words', title = 'Top words used in reviews by sentiment') +
coord_flip() +
facet_wrap( ~ sentiment, scales = 'free')
## Joining, by = "word"
## `summarise()` has grouped output by 'sentiment'. You can override using the
## `.groups` argument.
Lastly, using AFINN lexicon to quantify the sentiment scores of customer
reviews, and plot out the distribution of customer sentiment – it’s
pretty bi-polar.
#Avg sentiment scores
tokenized_df_stemmed %>%
inner_join(get_sentiments('afinn')) %>%
group_by(is_verified_trip) %>%
summarise(mean_score = mean(value))
## Joining, by = "word"
## # A tibble: 2 × 2
## is_verified_trip mean_score
## <lgl> <dbl>
## 1 FALSE -0.00587
## 2 TRUE -0.0183
#distribution of sentiment scores
tokenized_df_stemmed %>%
inner_join(get_sentiments('afinn')) %>%
ggplot(aes(x = value)) +
geom_density(aes(fill = is_verified_trip), alpha = 0.3)
## Joining, by = "word"
The final submission can be viewed in the deck here