library(tidyverse)
library(rvest)

This is the Rmarkdown for the British Airway data analysis task submission task 1. Task 1 is focused on scraping and collecting customer feedback and reviewing data from a third-party source and analyzing this data to present insights I uncover.

I will utilize sentiment analysis technique to analyze customer reviews and construct one-slide deck as a final product.

Preparing to scrape web data

Setting up Skytrax customer review web scraping parameters. I will crape 3K reviews from the website.

base_url <-  "https://www.airlinequality.com/airline-reviews/british-airways/page/"
pages <-  30
page_size <-  100
review_content <- c(rep(NA, pages* page_size))
review_title <- c(rep(NA, pages* page_size))

Scraping web data

Using Rvest package to scrape 3K most recent reviews about British Airways. Will also check if any scrapes contain missing review to ensure data hygine.

## [1] "Finish reading page 1 currently with 100 reviews"
## [1] "Finish reading page 2 currently with 200 reviews"
## [1] "Finish reading page 3 currently with 300 reviews"
## [1] "Finish reading page 4 currently with 400 reviews"
## [1] "Finish reading page 5 currently with 500 reviews"
## [1] "Finish reading page 6 currently with 600 reviews"
## [1] "Finish reading page 7 currently with 700 reviews"
## [1] "Finish reading page 8 currently with 800 reviews"
## [1] "Finish reading page 9 currently with 900 reviews"
## [1] "Finish reading page 10 currently with 1000 reviews"
## [1] "Finish reading page 11 currently with 1100 reviews"
## [1] "Finish reading page 12 currently with 1200 reviews"
## [1] "Finish reading page 13 currently with 1300 reviews"
## [1] "Finish reading page 14 currently with 1400 reviews"
## [1] "Finish reading page 15 currently with 1500 reviews"
## [1] "Finish reading page 16 currently with 1600 reviews"
## [1] "Finish reading page 17 currently with 1700 reviews"
## [1] "Finish reading page 18 currently with 1800 reviews"
## [1] "Finish reading page 19 currently with 1900 reviews"
## [1] "Finish reading page 20 currently with 2000 reviews"
## [1] "Finish reading page 21 currently with 2100 reviews"
## [1] "Finish reading page 22 currently with 2200 reviews"
## [1] "Finish reading page 23 currently with 2300 reviews"
## [1] "Finish reading page 24 currently with 2400 reviews"
## [1] "Finish reading page 25 currently with 2500 reviews"
## [1] "Finish reading page 26 currently with 2600 reviews"
## [1] "Finish reading page 27 currently with 2700 reviews"
## [1] "Finish reading page 28 currently with 2800 reviews"
## [1] "Finish reading page 29 currently with 2900 reviews"
## [1] "Finish reading page 30 currently with 3000 reviews"
## [1] "Missing review text content is 0 % of data"
## [1] "Missing review text content is 0 % of data"

Cleaning scraping data

Once data is scraped, I will clean the data and store them in a tidy format: with one review per row, and cleaning up the text to remove unnecessary information like emoji and parse prefix “Trip Verified” as a different column for later analysis use.

df <- tibble(
  id = seq(1:length(review_content)), 
  review_title = review_title, 
  review_content = review_content
  ) %>%
  mutate(
    is_verified_trip = str_detect(review_content, 'Trip Verified'),
    review_text = str_remove(string = review_content, 
                             pattern = regex('[:emoji:] trip verified \\| ', 
                             ignore_case = T))

  )

glimpse(df)
## Rows: 3,000
## Columns: 5
## $ id               <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ review_title     <chr> "Lots of cancellations and delays", "Overall, very ha…
## $ review_content   <chr> "✅ Trip Verified | Lots of cancellations and delays …
## $ is_verified_trip <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ review_text      <chr> "Lots of cancellations and delays and no one apologiz…

Conduct basic EDA

Now data is in tidy format, I will conduct basic EDA to look at the share of reviews from verified trip vs non-verified trip.

library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
#Number of reviews are verified
df %>%
  group_by(is_verified_trip) %>% 
  summarise(n = n()) %>% 
  mutate(pct = n/sum(n)) %>%
  ggplot(aes(x = is_verified_trip, y = pct)) + 
  geom_col(aes(fill = is_verified_trip)) + 
  scale_y_continuous(label = scales::percent) + 
  geom_label(aes(label = paste0(round(pct *100), '%'), x = is_verified_trip , y = pct + 0.05 )) + 
  labs(x = 'Is Trip Verified', y = '% of record', title = '% of review are verified trip')

I also looked at the top words used in reviews to get high level info on what users usually talk about when they review British Airways.

library(tidytext)
library(textdata)
library(SnowballC)

tokenized_df <- df %>% 
  unnest_tokens(input = 'review_text', output = 'word') %>%
  anti_join(stop_words) %>%
  filter(!str_detect( word, "([:digit:]|[:punct:])"))
## Joining, by = "word"
#stemming words
tokenized_df_stemmed <- tokenized_df %>%
  mutate(word_stemmed = wordStem(word))
#top words  
tokenized_df_stemmed %>%
  count(word_stemmed) %>%
  mutate(pct = n/sum(n)) %>%
  arrange(desc(n)) %>%
  head(20) %>%
  ggplot() +
  geom_col(aes(x = reorder(word_stemmed, n), y = pct)) + 
  scale_y_continuous(label = scales::percent) + 
  labs(x = 'Top words in review text', y = 'Word prevalence (% of all words in review text)',
       title = 'Top words used in user reviews') + 
  coord_flip() 

Also breaking down the top words used in reviews and compare if there are differences between reviews from verified trips vs non-verified trips.

#top words by verification status
tokenized_df_stemmed %>%
  group_by(is_verified_trip, word_stemmed) %>%
  summarise(
    n = n(), 
    .groups = 'drop'
  ) %>%
  ungroup() %>%
  group_by(is_verified_trip) %>%
  mutate(
    pct = n/sum(n),
    rank = dense_rank(desc(n)),
  ) %>%
  filter(rank <= 20) %>%
  ggplot() +
  geom_col(aes(x = reorder(word_stemmed, n), y = pct, fill = is_verified_trip)) + 
  labs(x = 'Top words in review text', y = 'Word prevalence (% of all words in review text)',
       title = 'Top words used in user reviews by trip verification status') + 
  scale_y_continuous(label = scales::percent) + 
  coord_flip() +
  facet_wrap(~ is_verified_trip, scales = 'free')

## Conduct sentiment analysis from customer reviews

I use sentiment analysis technique to extract sentiments from customer reviews to understand the overall sentiments from customers.

tokenized_df_stemmed %>%
  inner_join(get_sentiments('bing')) %>%
  count(sentiment) %>%
  mutate(pct = n/sum(n)) %>%
  ggplot(aes(x = sentiment, y = pct)) + 
  geom_col(aes(fill = sentiment)) + 
  scale_y_continuous(labels = scales::percent) + 
  geom_label(aes(label = paste0(round(pct*100), '%'), x = sentiment, y = pct + 0.02)) + 
  labs(
    x = 'Sentiment', 
    y = 'Prevalence by word used', 
    title = 'Overall word sentiment used in reviews'
  ) 
## Joining, by = "word"

tokenized_df_stemmed %>%
  inner_join(get_sentiments('bing')) %>%
  group_by(is_verified_trip, sentiment) %>%
  summarise(n = n()) %>%
  group_by(is_verified_trip) %>%
  mutate(
    pct = n/sum(n),
    is_verified_trip = ifelse(is_verified_trip, 'Verified Trip', 'Not Verified Trip')
    ) %>%
  ggplot(aes(x = sentiment, y = pct)) + 
  geom_col(aes(fill = sentiment)) + 
  scale_y_continuous(labels = scales::percent) + 
  geom_label(aes(label = paste0(round(pct*100), '%'), x = sentiment, y = pct + 0.02)) + 
  labs(
    x = 'Sentiment', 
    y = 'Prevalence by word used', 
    title = 'Overall word sentiment used in reviews'
  ) + 
  facet_wrap(~ is_verified_trip)
## Joining, by = "word"
## `summarise()` has grouped output by 'is_verified_trip'. You can override using
## the `.groups` argument.

Checking the top words used associated with positive vs negative sentiment to understand the context.

#top words used in positive vs negative sentiments

tokenized_df_stemmed %>%
  inner_join(get_sentiments('bing')) %>%
  group_by(sentiment, word) %>%
  summarise(
    n = n()
  ) %>%
  group_by(sentiment) %>%
  mutate(
    pct = n/sum(n), 
    rank = dense_rank(desc(pct))
  ) %>%
  filter(rank <= 20) %>%
  ggplot() + 
  geom_col(aes(x = reorder(word, n), y = pct, fill = sentiment)) + 
  scale_y_continuous(label = scales::percent) + 
  labs(y = '% word used in reviews', x = 'Words', title = 'Top words used in reviews by sentiment') +
  coord_flip() + 
  facet_wrap( ~ sentiment, scales = 'free') 
## Joining, by = "word"
## `summarise()` has grouped output by 'sentiment'. You can override using the
## `.groups` argument.

Lastly, using AFINN lexicon to quantify the sentiment scores of customer reviews, and plot out the distribution of customer sentiment – it’s pretty bi-polar.

#Avg sentiment scores
tokenized_df_stemmed %>%
  inner_join(get_sentiments('afinn')) %>%
  group_by(is_verified_trip) %>%
  summarise(mean_score = mean(value)) 
## Joining, by = "word"
## # A tibble: 2 × 2
##   is_verified_trip mean_score
##   <lgl>                 <dbl>
## 1 FALSE              -0.00587
## 2 TRUE               -0.0183
#distribution of sentiment scores
tokenized_df_stemmed %>%
  inner_join(get_sentiments('afinn')) %>%
  ggplot(aes(x = value)) + 
  geom_density(aes(fill = is_verified_trip), alpha = 0.3)
## Joining, by = "word"

The final submission can be viewed in the deck here