Sentiment Analysis of Professional Car Reviews

Author

Ben Shilling

Introduction

Buying a car is one of the biggest purchases people will make in their whole life, so why not be informed when buying it? I am looking to purchase a Japanese make car, preferably a Honda or Toyota and I would like to learn what the industry has to say about them. I would like to know what emotions these cars bring out as well as a general positive or negative attitude about the car.

To find this out I will be using data harvested from articles on caranddriver.com on 5 models of cars from the years 2020 and 2024. Using the function and an appropriate user agent you will be able to one collect data on your own list of cars to then do your own sentiment analysis.

Data Collection

library(rvest) # to easily access HTML
library(tidyverse) # for all things tidy!
library(httr) # Lets GET some Data!

car_data <- data.frame(
  Make = c("honda", "toyota", "honda", "honda", "toyota", 
           "honda", "toyota", "honda", "honda", "toyota"),
  Model = c("accord", "rav4", "cr-v", "pilot", "camry", 
            "accord", "rav4", "cr-v", "pilot", "camry"),
  Year = c(2020, 2020, 2020, 2020, 2020,
           2024, 2024, 2024, 2024, 2024)
) # The cars we want to learn about

##### Function ##### 

car_and_driver_review <- 
  function(list_of_cars){ # input a list of cars and learn what caranddriver.com 
                          # has to say

    car_data <- list_of_cars 
    
    car_and_driver_car_reviews <- data.frame() # prepare a loop
    
    for (i in seq_along(car_data$Model)) {
      
      # Construct the URL dynamically
      url <- paste0("https://www.caranddriver.com/", car_data$Make[i], "/", car_data$Model[i], "-", car_data$Year[i])
      
      # Read the HTML content from the URL (without indexing the URL here)
      review <- read_html(url)
      
      # Scrape the headers nased on the paragraph
      review_headers <- 
        review %>% 
        html_elements("p.css-zcoxvl.e802kp11") %>% 
        html_attr("data-anchor-id")
      
      # Scrape the review body
      review_body <- 
        review %>% 
        html_elements("p.css-zcoxvl.e802kp11") %>% 
        html_text2()
      
      make <- car_data$Make[i]
      model <- car_data$Model[i]
      year <- car_data$Year[i]
      
      # Create columns to easily id each review
      
      # Create a temporary data frame and bind it to the main data frame
      temp_df <- data.frame(header = review_headers, text = review_body, make = make, 
                            model = model, year = year)
      
      # Append the temp data frame to the main data frame
      car_and_driver_car_reviews <- bind_rows(car_and_driver_car_reviews, temp_df)
      
      print(paste("Collected Review on a", car_data$Make[i], car_data$Model[i], "from", car_data$Year[i], "(", i, "/20 )", sep = " "))
      
      # Random sleep time between 5 and 60 seconds (I was put in timeout lot)
      sleep_time <- runif(1, min = 5, max = 60)  
      Sys.sleep(sleep_time)
    }
  return(car_and_driver_car_reviews)
}

Using your own list of cars you will be able to harvest the articles split by paragraph header to do your own analysis!

Methods for Analysis

To analyze the text I will be using the function of tidy text package to tokenize each block of harvested text, where using preassigned sentiments I will be able to gather what the car reviews have to say about these models. To complete this I will be using the bing and nrc lexicons provided by the tidy text package, harvested data as well as ggplot visuals and knitted tables.

Analysis

Which car was the most positively received?

The data I collected contained 5 models from Toyota and Honda from the years 2020 to 2024. These cars where the Honda Accord, CR-V and Pilot, and the Toyota Camry and Rav4. I chose these models because the CR-V and Rav4, and Accord and Camry are direct competitors and the pilot is the Honda that I would like the most. Out the cars I would like to know which car was the most positively received.

Cars by Negativity Ratio
make model fear trust positive negative positivity negativity_ratio
honda accord 20 67 92 29 63 20.25
honda pilot 14 73 101 31 70 29.25
toyota rav4 13 76 107 44 63 30.17
honda cr-v 18 63 96 39 57 36.32
toyota camry 19 62 99 39 60 37.46

It all makes sense why my dad wanted me to get an Accord. From this we can tell that the Honda Accord on average is receiving 5 positive words to each negative word about it, while it’s direct competitor the Toyota Camry has the worst positive to negative ratio. Does this change based on the year?

Cars by Negativity Ratio
year make model fear trust positive negative positivity negativity_ratio
2024 honda accord 11 28 43 12 31 15.91
2020 honda accord 9 39 49 17 32 25.56
2020 honda pilot 3 40 53 14 39 27.02
2020 toyota rav4 8 38 58 22 36 29.84
2024 toyota rav4 5 38 49 22 27 30.50
2024 toyota camry 10 32 42 18 24 30.52
2024 honda pilot 11 33 48 17 31 31.27
2024 honda cr-v 6 24 36 18 18 33.41
2020 honda cr-v 12 39 60 21 39 39.88
2020 toyota camry 9 30 57 21 36 43.76

Turns out the Honda Accord is really a beloved car, meanwhile there is evidence that the 2020 Camry was not that good of a car. It seems like the CR-V is not well as well received as it’s Rav4 counterpart. Based on this we can see that the Honda Accord is convincingly the most positively received car out of the cars tested.

What are the most common words describing Honda’s and Toyota’s?

From the earlier analysis we can see that the car received the best was a Honda and the reviewers least favorite car was a Toyota, but comparing across the brands rather than at the model level what are the most common word used to describe the brands?

As we saw in the sentiment’s using the NRC lexicon we can see that the reviewers did not have much negative to say about either brand. One common negative word that came up was the word “hard” but this word can refer to the seats or breaks. One thing too take not of it is even though one more article was collected on Honda, they still have proportionally higher grades in words used compared to Toyota. Based on this there isn’t much we can learn, but we can guess that a lot of the positive words for Honda come from describing the 2024 Accord. What we can infer though is since the word top is used with Honda and Toyota, the Honda is more than likely a leader in it’s division of cars.

What are the most common words describing 2020s and 2024s?

The words used to describe each years model of cars are not very different. This could simply be that not much has changed in the past few years for auto makers. Because of this we cannot say there is much change year to year.

Conclusion

Based on the analysis of the caranddriver.com articles I am going to buy another Honda Accord once that adult money hits!