I am analyzing the most popular luxury SUVs on the market (according to car rating website Edmunds.com) with the intention of finding the best one to purchase after graduation. I have found that two of the most reliable and highly rated SUV brands in my price range are Lexus and Acura, so I will be comparing reviews of two of their most popular entry level SUV models: the Acura RDX and the Lexus NX.
Note: I will refer to a positivity value throughout the analysis. This is a calculated field based on the overall positivity or negativity of the words used to write a review. A negative positivity value indicates a negative review.
Question 1: Is there a correlation between the month and the number of reviews published?
My hypothesis is that there would be more reviews around December and January, as these are generally the most popular times to purchase a car and it seems likely that people would review the car while it is relatively new to them.
Analysis
New model years typically being released in December, coupled with Christmas and promotions (such as the Lexus December to Remember Sales Event), makes December one of the most popular months to purchase a car. It would be reasonable to infer that people are more likely to review cars that they have recently bought, therefore causing a spike in car reviews around December, January, and February. While this pattern generally holds true, Lexus seems to have the most significant spike in reviews in December, while Acura’s largest spike happens in May. This suggests that there are likely other variables that impact when a consumer chooses to purchase a review outside of how recently they bought the car, or that Acura has particularly effective sales events in “off” months that Lexus does not have.
Question 2: Is one car reviewed more positively than the other?
I examined the distribution of the overall positivity scores of each car to understand where the majority of reviews fell and how outliers impacted the scores.
Both cars have very similar median positivity values, with Acura having a median value of 5 and Lexus having a median value of 5.5. Outliers can likely be attributed to excessively long reviews, which would have a higher number of scoreable words. The similarity in positivity ratings is unsurprising, as both cars have very similar over-all ratings on the Edmunds website, with the Acura having a 4.1/5 star rating and the Lexus having a 4.2/5 star rating.
Question 3: Is the positivity value an accurate reflection of the reviewer’s feelings about the car?
To answer this question, I compared the star value that the reviewer assigned the car with the calculated positivity value to understand if a higher star value correlates to a higher positivity score.
Analysis
There appears to be a general trend that the higher the star rating the reviewer gave, the higher the overall positivity was of the language they used in their reviews. This suggets that the results of the text analysis are generally reliable. However, it should be noted that the overlapping margins of error suggest that further ANOVA is needed to understand the statistical relationship between these two variables.
Question 4: What are the top words used to describe each car? Do they differ between the two cars?
Analysis
Note: the word “5” for the Acura was used 391 times (not 3) but was cut off
The top 10 words used to describe both cars were extremely similar, with “5” ,“stars”, and car being the top 3 words used for both. These common words highlight the general qualities of a car that the reviews value, such as technology, comfort, and reliability.
Source Code
---title: "Assignment 7 Acura vs Lexus"author: "Emma Black"editor: visualtoc: true # Generates an automatic table of contents.format: # Options related to formatting. html: # Options related to HTML output. code-tools: TRUE # Allow the code tools option showing in the output. embed-resources: TRUE # Embeds all components into a single HTML file. execute: # Options related to the execution of code chunks. warning: FALSE # FALSE: Code chunk sarnings are hidden by default. message: FALSE # FALSE: Code chunk messages are hidden by default. echo: FALSE # TRUE: Show all code in the output.---## Introduction I am analyzing the most popular luxury SUVs on the market (according to car rating website Edmunds.com) with the intention of finding the best one to purchase after graduation. I have found that two of the most reliable and highly rated SUV brands in my price range are Lexus and Acura, so I will be comparing reviews of two of their most popular entry level SUV models: the Acura RDX and the Lexus NX.Note: I will refer to a positivity value throughout the analysis. This is a calculated field based on the overall positivity or negativity of the words used to write a review. A negative positivity value indicates a negative review.```{r}#| label: load libraries#| include: FALSE#| message: falselibrary(tidyverse) # For all the tidy things!library(lubridate) # Convenient transforming of date values.library(knitr) # Useful tools when 'knitting' (rendering) Quarto documents.library(skimr) # tool for quickly assessing data setlibrary(jsonlite) # Converting json data into data frameslibrary(magrittr) # Extracting items from list objects using piping grammarlibrary(httr) # Interacting with HTTP verbslibrary(tidytext) # Tidy text mininglibrary(textdata) # Lexicons of sentiment datalibrary(widyr) # Easily calculating pairwise counts``````{r}#| label: load csvlex_acura_reviews <-read.csv("https://myxavier-my.sharepoint.com/:x:/g/personal/blacke6_xavier_edu/Ef-R7Rp2nRVCvd1Ae8cXPgwB7zD6WlmtXA5w1QnWXFLdXw?download=1")``````{r}#| label: prepare the data#tokenize each review by wordtidy_reviews <- lex_acura_reviews %>%unnest_tokens(word, review) %>%anti_join(stop_words) #join the review data to the NRC lexicon nrc <-get_sentiments("nrc")suv_sentiments <- tidy_reviews %>%inner_join(nrc, by ="word", relationship ="many-to-many") %>%group_by(sentiment, user_name) %>%summarize(n =n()) %>%pivot_wider(names_from = sentiment, values_from = n, values_fill =0) %>%mutate(positivity = positive-negative)#group sentiment analysis at the review level reviews_w_sentiments <- lex_acura_reviews %>%inner_join(suv_sentiments, by ="user_name") %>%mutate(date =ymd(date))```## Question 1: Is there a correlation between the month and the number of reviews published?My hypothesis is that there would be more reviews around December and January, as these are generally the most popular times to purchase a car and it seems likely that people would review the car while it is relatively new to them.```{r}#| label: question 1# Add a column for the monthreviews_w_sentiments <- reviews_w_sentiments %>%mutate(month =floor_date(date, unit ="month")) # Extract the month from the date# Group by month and car_model, then count the number of reviewsmonthly_reviews_count <- reviews_w_sentiments %>%group_by(month, car_model) %>%summarize(review_count =n(), .groups ="drop")# Create the bar graphggplot(monthly_reviews_count, aes(x = month, y = review_count, fill = car_model)) +geom_bar(stat ="identity", position ="dodge") +# Use dodge to separate bars by car_modellabs(title ="Number of Reviews by Month and Car Model",x ="Month",y ="Number of Reviews",fill ="Car Model" ) +scale_x_date(date_labels ="%b %Y", date_breaks ="1 month") +# Format x-axis for monthstheme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) # Rotate x-axis labels for readability```### AnalysisNew model years typically being released in December, coupled with Christmas and promotions (such as the Lexus December to Remember Sales Event), makes December one of the most popular months to purchase a car. It would be reasonable to infer that people are more likely to review cars that they have recently bought, therefore causing a spike in car reviews around December, January, and February. While this pattern generally holds true, Lexus seems to have the most significant spike in reviews in December, while Acura's largest spike happens in May. This suggests that there are likely other variables that impact when a consumer chooses to purchase a review outside of how recently they bought the car, or that Acura has particularly effective sales events in "off" months that Lexus does not have.## Question 2: Is one car reviewed more positively than the other?I examined the distribution of the overall positivity scores of each car to understand where the majority of reviews fell and how outliers impacted the scores.```{r}ggplot(reviews_w_sentiments, aes(x = positivity, fill = car_model)) +geom_histogram(binwidth =1, position ="dodge", alpha =0.7, color ="black") +facet_wrap(~ car_model) +# Facet by car_model to create separate plotsgeom_vline(data = reviews_w_sentiments %>%group_by(car_model) %>%summarize(median_positivity =median(positivity, na.rm =TRUE)),aes(xintercept = median_positivity, color = car_model),linetype ="dashed", size =1) +# Add dashed lines for medianlabs(title ="Distribution of Positivity Scores for Lexus and Acura",x ="Positivity Score",y ="Count",fill ="Car Model",color ="Car Model" ) +theme_minimal() +theme(legend.position ="top", # Position the legend on topaxis.text.x =element_text(angle =45, hjust =1) # Rotate x-axis labels for better readability )``````{r}median_positivity_by_model <- reviews_w_sentiments %>%group_by(car_model) %>%summarize(median_positivity =median(positivity, na.rm =TRUE))# Print the resultprint(median_positivity_by_model)```### AnalysisBoth cars have very similar median positivity values, with Acura having a median value of 5 and Lexus having a median value of 5.5. Outliers can likely be attributed to excessively long reviews, which would have a higher number of scoreable words. The similarity in positivity ratings is unsurprising, as both cars have very similar over-all ratings on the Edmunds website, with the Acura having a 4.1/5 star rating and the Lexus having a 4.2/5 star rating.## Question 3: Is the positivity value an accurate reflection of the reviewer's feelings about the car?To answer this question, I compared the star value that the reviewer assigned the car with the calculated positivity value to understand if a higher star value correlates to a higher positivity score.```{r}ggplot(reviews_w_sentiments, aes(x =as.factor(stars), y = positivity, fill = car_model)) +geom_boxplot(alpha =0.7, outlier.shape =16, outlier.colour ="red") +# Box plot with outliers highlighted in redlabs(title ="Positivity Score Distribution by Star Rating",x ="Star Rating",y ="Positivity Score",fill ="Car Model" ) +theme_minimal() +theme(legend.position ="top", # Position the legend on topaxis.text.x =element_text(angle =45, hjust =1) # Rotate x-axis labels for readability )```### AnalysisThere appears to be a general trend that the higher the star rating the reviewer gave, the higher the overall positivity was of the language they used in their reviews. This suggets that the results of the text analysis are generally reliable. However, it should be noted that the overlapping margins of error suggest that further ANOVA is needed to understand the statistical relationship between these two variables.## Question 4: What are the top words used to describe each car? Do they differ between the two cars?```{r}top_words <- tidy_reviews %>%count(car_model, word, sort =TRUE) %>%# Count the frequency of each word per car modelgroup_by(car_model) %>%top_n(10, n) %>%# Get the top 10 most frequent words for each car modelungroup() %>%view()ggplot(top_words, aes(x =reorder(word, n), y = n, fill = car_model)) +geom_bar(stat ="identity") +coord_flip() +# Flip the coordinates to make it horizontalfacet_wrap(~ car_model, scales ="free_y") +# Separate the plots by car model (Lexus and Acura)labs(title ="Top 10 Most Frequently Used Words in Lexus and Acura Reviews",x ="Word",y ="Frequency",fill ="Car Model" ) +geom_text(aes(label = n), hjust =-0.2, color ="black", size =3) +# Add labels to the barstheme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1), # Rotate x-axis labels for readabilitylegend.position ="top" )```### Analysis Note: the word "5" for the Acura was used 391 times (not 3) but was cut offThe top 10 words used to describe both cars were extremely similar, with "5" ,"stars", and car being the top 3 words used for both. These common words highlight the general qualities of a car that the reviews value, such as technology, comfort, and reliability.