Assignment 7 Acura vs Lexus

Author

Emma Black

Introduction

I am analyzing the most popular luxury SUVs on the market (according to car rating website Edmunds.com) with the intention of finding the best one to purchase after graduation. I have found that two of the most reliable and highly rated SUV brands in my price range are Lexus and Acura, so I will be comparing reviews of two of their most popular entry level SUV models: the Acura RDX and the Lexus NX.

Note: I will refer to a positivity value throughout the analysis. This is a calculated field based on the overall positivity or negativity of the words used to write a review. A negative positivity value indicates a negative review.

Question 1: Is there a correlation between the month and the number of reviews published?

My hypothesis is that there would be more reviews around December and January, as these are generally the most popular times to purchase a car and it seems likely that people would review the car while it is relatively new to them.

Analysis

New model years typically being released in December, coupled with Christmas and promotions (such as the Lexus December to Remember Sales Event), makes December one of the most popular months to purchase a car. It would be reasonable to infer that people are more likely to review cars that they have recently bought, therefore causing a spike in car reviews around December, January, and February. While this pattern generally holds true, Lexus seems to have the most significant spike in reviews in December, while Acura’s largest spike happens in May. This suggests that there are likely other variables that impact when a consumer chooses to purchase a review outside of how recently they bought the car, or that Acura has particularly effective sales events in “off” months that Lexus does not have.

Question 2: Is one car reviewed more positively than the other?

I examined the distribution of the overall positivity scores of each car to understand where the majority of reviews fell and how outliers impacted the scores.

# A tibble: 2 × 2
  car_model median_positivity
  <chr>                 <dbl>
1 acura                   5  
2 lexus                   5.5

Analysis

Both cars have very similar median positivity values, with Acura having a median value of 5 and Lexus having a median value of 5.5. Outliers can likely be attributed to excessively long reviews, which would have a higher number of scoreable words. The similarity in positivity ratings is unsurprising, as both cars have very similar over-all ratings on the Edmunds website, with the Acura having a 4.1/5 star rating and the Lexus having a 4.2/5 star rating.

Question 3: Is the positivity value an accurate reflection of the reviewer’s feelings about the car?

To answer this question, I compared the star value that the reviewer assigned the car with the calculated positivity value to understand if a higher star value correlates to a higher positivity score.

Analysis

There appears to be a general trend that the higher the star rating the reviewer gave, the higher the overall positivity was of the language they used in their reviews. This suggets that the results of the text analysis are generally reliable. However, it should be noted that the overlapping margins of error suggest that further ANOVA is needed to understand the statistical relationship between these two variables.

Question 4: What are the top words used to describe each car? Do they differ between the two cars?

Analysis

Note: the word “5” for the Acura was used 391 times (not 3) but was cut off

The top 10 words used to describe both cars were extremely similar, with “5” ,“stars”, and car being the top 3 words used for both. These common words highlight the general qualities of a car that the reviews value, such as technology, comfort, and reliability.

--- title: "Assignment 7 Acura vs Lexus" author: "Emma Black" editor: visual toc: true # Generates an automatic table of contents. format: # Options related to formatting. html: # Options related to HTML output. code-tools: TRUE # Allow the code tools option showing in the output. embed-resources: TRUE # Embeds all components into a single HTML file. execute: # Options related to the execution of code chunks. warning: FALSE # FALSE: Code chunk sarnings are hidden by default. message: FALSE # FALSE: Code chunk messages are hidden by default. echo: FALSE # TRUE: Show all code in the output. --- ## Introduction I am analyzing the most popular luxury SUVs on the market (according to car rating website Edmunds.com) with the intention of finding the best one to purchase after graduation. I have found that two of the most reliable and highly rated SUV brands in my price range are Lexus and Acura, so I will be comparing reviews of two of their most popular entry level SUV models: the Acura RDX and the Lexus NX. Note: I will refer to a positivity value throughout the analysis. This is a calculated field based on the overall positivity or negativity of the words used to write a review. A negative positivity value indicates a negative review. ```{r} #| label: load libraries #| include: FALSE #| message: false library(tidyverse) # For all the tidy things! library(lubridate) # Convenient transforming of date values. library(knitr) # Useful tools when 'knitting' (rendering) Quarto documents. library(skimr) # tool for quickly assessing data set library(jsonlite) # Converting json data into data frames library(magrittr) # Extracting items from list objects using piping grammar library(httr) # Interacting with HTTP verbs library(tidytext) # Tidy text mining library(textdata) # Lexicons of sentiment data library(widyr) # Easily calculating pairwise counts ``` ```{r} #| label: load csv lex_acura_reviews <- read.csv("https://myxavier-my.sharepoint.com/:x:/g/personal/blacke6_xavier_edu/Ef-R7Rp2nRVCvd1Ae8cXPgwB7zD6WlmtXA5w1QnWXFLdXw?download=1") ``` ```{r} #| label: prepare the data #tokenize each review by word tidy_reviews <- lex_acura_reviews %>% unnest_tokens(word, review) %>% anti_join(stop_words) #join the review data to the NRC lexicon nrc <- get_sentiments("nrc") suv_sentiments <- tidy_reviews %>% inner_join(nrc, by = "word", relationship = "many-to-many") %>% group_by(sentiment, user_name) %>% summarize(n = n()) %>% pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% mutate(positivity = positive-negative) #group sentiment analysis at the review level reviews_w_sentiments <- lex_acura_reviews %>% inner_join(suv_sentiments, by = "user_name") %>% mutate(date = ymd(date)) ``` ## Question 1: Is there a correlation between the month and the number of reviews published? My hypothesis is that there would be more reviews around December and January, as these are generally the most popular times to purchase a car and it seems likely that people would review the car while it is relatively new to them. ```{r} #| label: question 1 # Add a column for the month reviews_w_sentiments <- reviews_w_sentiments %>% mutate(month = floor_date(date, unit = "month")) # Extract the month from the date # Group by month and car_model, then count the number of reviews monthly_reviews_count <- reviews_w_sentiments %>% group_by(month, car_model) %>% summarize(review_count = n(), .groups = "drop") # Create the bar graph ggplot(monthly_reviews_count, aes(x = month, y = review_count, fill = car_model)) + geom_bar(stat = "identity", position = "dodge") + # Use dodge to separate bars by car_model labs( title = "Number of Reviews by Month and Car Model", x = "Month", y = "Number of Reviews", fill = "Car Model" ) + scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") + # Format x-axis for months theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for readability ``` ### Analysis New model years typically being released in December, coupled with Christmas and promotions (such as the Lexus December to Remember Sales Event), makes December one of the most popular months to purchase a car. It would be reasonable to infer that people are more likely to review cars that they have recently bought, therefore causing a spike in car reviews around December, January, and February. While this pattern generally holds true, Lexus seems to have the most significant spike in reviews in December, while Acura's largest spike happens in May. This suggests that there are likely other variables that impact when a consumer chooses to purchase a review outside of how recently they bought the car, or that Acura has particularly effective sales events in "off" months that Lexus does not have. ## Question 2: Is one car reviewed more positively than the other? I examined the distribution of the overall positivity scores of each car to understand where the majority of reviews fell and how outliers impacted the scores. ```{r} ggplot(reviews_w_sentiments, aes(x = positivity, fill = car_model)) + geom_histogram(binwidth = 1, position = "dodge", alpha = 0.7, color = "black") + facet_wrap(~ car_model) + # Facet by car_model to create separate plots geom_vline( data = reviews_w_sentiments %>% group_by(car_model) %>% summarize(median_positivity = median(positivity, na.rm = TRUE)), aes(xintercept = median_positivity, color = car_model), linetype = "dashed", size = 1) + # Add dashed lines for median labs( title = "Distribution of Positivity Scores for Lexus and Acura", x = "Positivity Score", y = "Count", fill = "Car Model", color = "Car Model" ) + theme_minimal() + theme( legend.position = "top", # Position the legend on top axis.text.x = element_text(angle = 45, hjust = 1) # Rotate x-axis labels for better readability ) ``` ```{r} median_positivity_by_model <- reviews_w_sentiments %>% group_by(car_model) %>% summarize(median_positivity = median(positivity, na.rm = TRUE)) # Print the result print(median_positivity_by_model) ``` ### Analysis Both cars have very similar median positivity values, with Acura having a median value of 5 and Lexus having a median value of 5.5. Outliers can likely be attributed to excessively long reviews, which would have a higher number of scoreable words. The similarity in positivity ratings is unsurprising, as both cars have very similar over-all ratings on the Edmunds website, with the Acura having a 4.1/5 star rating and the Lexus having a 4.2/5 star rating. ## Question 3: Is the positivity value an accurate reflection of the reviewer's feelings about the car? To answer this question, I compared the star value that the reviewer assigned the car with the calculated positivity value to understand if a higher star value correlates to a higher positivity score. ```{r} ggplot(reviews_w_sentiments, aes(x = as.factor(stars), y = positivity, fill = car_model)) + geom_boxplot(alpha = 0.7, outlier.shape = 16, outlier.colour = "red") + # Box plot with outliers highlighted in red labs( title = "Positivity Score Distribution by Star Rating", x = "Star Rating", y = "Positivity Score", fill = "Car Model" ) + theme_minimal() + theme( legend.position = "top", # Position the legend on top axis.text.x = element_text(angle = 45, hjust = 1) # Rotate x-axis labels for readability ) ``` ### Analysis There appears to be a general trend that the higher the star rating the reviewer gave, the higher the overall positivity was of the language they used in their reviews. This suggets that the results of the text analysis are generally reliable. However, it should be noted that the overlapping margins of error suggest that further ANOVA is needed to understand the statistical relationship between these two variables. ## Question 4: What are the top words used to describe each car? Do they differ between the two cars? ```{r} top_words <- tidy_reviews %>% count(car_model, word, sort = TRUE) %>% # Count the frequency of each word per car model group_by(car_model) %>% top_n(10, n) %>% # Get the top 10 most frequent words for each car model ungroup() %>% view() ggplot(top_words, aes(x = reorder(word, n), y = n, fill = car_model)) + geom_bar(stat = "identity") + coord_flip() + # Flip the coordinates to make it horizontal facet_wrap(~ car_model, scales = "free_y") + # Separate the plots by car model (Lexus and Acura) labs( title = "Top 10 Most Frequently Used Words in Lexus and Acura Reviews", x = "Word", y = "Frequency", fill = "Car Model" ) + geom_text(aes(label = n), hjust = -0.2, color = "black", size = 3) + # Add labels to the bars theme_minimal() + theme( axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels for readability legend.position = "top" ) ``` ### Analysis Note: the word "5" for the Acura was used 391 times (not 3) but was cut off The top 10 words used to describe both cars were extremely similar, with "5" ,"stars", and car being the top 3 words used for both. These common words highlight the general qualities of a car that the reviews value, such as technology, comfort, and reliability.