SUV Analysis

Author

Emma Black

Introduction

I am analyzing the most popular SUVs on the market (according to car rating website Edmunds.com) with the intention of finding the best one to purchase after graduation. I will be examining variables such as price, MPG, value, technology, owner ratings, and expert ratings to gain a comprehensive understanding of each vehicle.

Data Dictionary

Click here to access the specific data set I obtained for this analysis.

Variable Name	Description
car_name	year, make, and model of a car
car_price	an average of the given range of MSRP values, reflecting different trims
cost_to_drive	the estimated monthly fuel costs assuming the car is driven primarily in Ohio, driven 15,000 miles per year, and 55% of those miles are 45% are highway
owner_stars	the average star rating out of 5 given in reviews of real owners
num_owner_reviews	the number of owner reviews posted for the car
total_rating	the rating out of 10 that the experts at Edmunds assessed for the overall vehicle
mpg	miles per gallon
tech_rate	the rating out of 10 that experts at Edmunds assessed for the vehicle’s technology features
interior_rate	the rating out of 10 that experts at Edmunds assessed for the vehicle’s interior quality
value_rating	the rating out of 10 that experts at Edmunds assigned based on the quality and amount of features for the given vehicle price

Part 1

Summary Statistics

Transposed Summary Statistics
variable	cost_to_drive	owner_stars	num_reviews	mpg	car_price
avg	198.32558	3.9688889	37.40909	22.44186	77322.71
median	197.00000	4.0000000	37.50000	22.00000	61975.00
sd	60.06137	0.5107135	22.05313	4.62036	48551.63
min	116.00000	2.9000000	2.00000	13.00000	26200.00
max	368.00000	5.0000000	98.00000	30.00000	224800.00

Which brand has the most cars on the list?

Because the list consists of the top 3 SUVs in each sub-category (such as Small 3 Row and Midsize Luxury), the brand with the most cars on the list is likely a brand that consistently produces quality vehicles.

Analysis

Mercedes is the clear front runner with 7 cars mentioned on the list, compared to the next highest of Audi at 4 cars. I found it more useful to compare the number of cars from each brand that made the list rather than the mean or median rating because all of the cars on the list are considered the best in their respective sub-categories. Therefore, there is not much difference in the mean and median values of their total rating by Edmunds.

Is the overall rating from Edmunds experts aligned with the owner ratings?

While the experts st Edmunds likely have a lot of technical knowledge of what makes a “good” car, I myself am not a car enthusiast and likely don’t prioritize all the same features in a car that experts do. I feel that the opinions of common people who drive the cars regularly would more accurately predict how I might rate a car.

Analysis

It appears that at an aggregate level, while the median of the owner reviews and the expert reviews are virtually the same (8 vs 8.1), the owner reviews have vastly more variation. This makes sense as common consumers are likely to have more variation in their standards and preferences than experts. Additionally, there are more total owner reviews than expert reviews, meaning there is more opportunity for variation with owner reviews, but, as the central limit theorem suggests, a greater likelihood that the median of this larger sample size will more accurately reflect the true median.

Which car has the best value and how much does it cost?

Cars with the Highest Value Rate
car_name	value_rate	car_price
2025 Genesis GV70	8.5	52000
2024 Hyundai Palisade	8.5	45250
2025 Kia Telluride	8.5	44788
2025 Kia Sorento	8.5	39690

Median Car Price and Value Rate
median_car_price	median_value_rate
61975	7.5

Analysis:

The four cars tide for the highest value all have a value rating one full point above the median and prices well below the median. It’s also worth noting that two of the cars tied for best value are Kias, suggesting that this might be a more budget friendly alternative to Mercedes, which has the most total cars on the list.

Is there a correlation between MPG and price?

Analysis:

Yes, there is a negative correlation between price and MPG. This is likely due to the fact that performance vehicles (which tend to be more expensive) often prioritize power over fuel efficiency.

Is there a correlation between tech rating and price

Analysis:

Yes, it appears that the better the tech is in a car, the higher the price. However, it is worth noting that that cars with a tech rating of 9 have a wide range of prices, meaning that it is possible to get a car with high quality tech without breaking the bank.

Part 2

Two very comparable SUV brands in my price range are Lexus and Acura, so I will be comparing owner reviews sourced from Edmunds.com of two of their most popular SUV models: the Acura RDX and the Lexus NX.

Note: I will refer to a positivity value throughout the analysis. This is a calculated field based on the overall positivity or negativity of the words used to write a review. A negative positivity value indicates a negative review.

Click here to access the exact data set I used.

Total Ratings for Lexus NX and Acura RDX
car_name	car_price	cost_to_drive	owner_stars	num_owner_reviews	total_rating	mpg	tech_rate	interior_rate	value_rate	brand	owner_stars_2x
2024 Acura RDX	49250	206	4.1	39	7.9	23	8.5	7.5	8	Acura	8.2
2025 Lexus NX	51572	121	4.2	31	7.8	28	8.0	7.5	8	Lexus	8.4

Exploration: What are the top words used to describe each car? Do they differ between the two cars?

Analysis

Note: the word “5” for the Acura was used 391 times (not 3) but was cut off

The top 10 words used to describe both cars were extremely similar, with “5” ,“stars”, and car being the top 3 words used for both. These common words highlight the general qualities of a car that the reviews value, such as technology, comfort, and reliability.

Is there a correlation between the month and the number of reviews published?

My hypothesis is that there would be more reviews around December and January, as these are generally the most popular times to purchase a car and it seems likely that people would review the car while it is relatively new to them.

Analysis

New model years typically being released in December, coupled with Christmas and promotions (such as the Lexus December to Remember Sales Event), makes December one of the most popular months to purchase a car. It would be reasonable to infer that people are more likely to review cars that they have recently bought, therefore causing a spike in car reviews around December, January, and February. While this pattern generally holds true, Lexus seems to have the most significant spike in reviews in December, while Acura’s largest spike happens in May. This suggests that there are likely other variables that impact when a consumer chooses to purchase a review outside of how recently they bought the car, or that Acura has particularly effective sales events in “off” months that Lexus does not have.

Is one car reviewed more positively than the other?

I examined the distribution of the overall positivity scores of each car to understand where the majority of reviews fell and how outliers impacted the scores.

# A tibble: 2 × 2
  car_model median_positivity
  <chr>                 <dbl>
1 acura                   5  
2 lexus                   5.5

Analysis

Both cars have very similar median positivity values, with Acura having a median value of 5 and Lexus having a median value of 5.5. Outliers can likely be attributed to excessively long reviews, which would have a higher number of scoreable words. The similarity in positivity ratings is unsurprising, as both cars have very similar over-all ratings on the Edmunds website, with the Acura having a 4.1/5 star rating and the Lexus having a 4.2/5 star rating.

Is the positivity value an accurate reflection of the reviewer’s feelings about the car?

To answer this question, I compared the star value that the reviewer assigned the car with the calculated positivity value to understand if a higher star value correlates to a higher positivity score.

Analysis

There appears to be a general trend that the higher the star rating the reviewer gave, the higher the overall positivity was of the language they used in their reviews. This suggets that the results of the text analysis are generally reliable. However, it should be noted that the overlapping margins of error suggest that further ANOVA is needed to understand the statistical relationship between these two variables.

--- title: "SUV Analysis" author: "Emma Black" editor: visual toc: true # Generates an automatic table of contents. format: # Options related to formatting. html: # Options related to HTML output. code-tools: TRUE # Allow the code tools option showing in the output. embed-resources: TRUE # Embeds all components into a single HTML file. execute: # Options related to the execution of code chunks. warning: FALSE # FALSE: Code chunk sarnings are hidden by default. message: FALSE # FALSE: Code chunk messages are hidden by default. echo: FALSE # TRUE: Show all code in the output. --- ## Introduction I am analyzing the most popular SUVs on the market (according to car rating website [Edmunds.com](https://www.edmunds.com/)) with the intention of finding the best one to purchase after graduation. I will be examining variables such as price, MPG, value, technology, owner ratings, and expert ratings to gain a comprehensive understanding of each vehicle. ## Data Dictionary [Click here](https://myxavier-my.sharepoint.com/:x:/g/personal/blacke6_xavier_edu/Eb16DCSE_dhEgA12H1R6BkYB24IpNnfeLL1M284T286NPQ?download=1) to access the specific data set I obtained for this analysis. | Variable Name | Description | |-------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------| | car_name | year, make, and model of a car | | car_price | an average of the given range of MSRP values, reflecting different trims | | cost_to_drive | the estimated monthly fuel costs assuming the car is driven primarily in Ohio, driven 15,000 miles per year, and 55% of those miles are 45% are highway | | owner_stars | the average star rating out of 5 given in reviews of real owners | | num_owner_reviews | the number of owner reviews posted for the car | | total_rating | the rating out of 10 that the experts at Edmunds assessed for the overall vehicle | | mpg | miles per gallon | | tech_rate | the rating out of 10 that experts at Edmunds assessed for the vehicle's technology features | | interior_rate | the rating out of 10 that experts at Edmunds assessed for the vehicle's interior quality | | value_rating | the rating out of 10 that experts at Edmunds assigned based on the quality and amount of features for the given vehicle price | ```{r} #| label: load libraries #| include: FALSE #| message: false library(tidyverse) # The tidyverse collection of packages library(httr) # Useful for web authentication library(rvest) # Useful tools for working with HTML and XML library(lubridate) # Working with dates library(magrittr) # Extracting items from list objects using piping grammar library(chromote) #allows for live view of web pages library(ggplot2) library(tidyr) library(jsonlite) # Converting json data into data frames library(tidytext) # Tidy text mining library(textdata) # Lexicons of sentiment data library(widyr) # Easily calculating pairwise counts library(stringr) library(knitr) ``` ```{r} #| label: load the data #| include: FALSE #| message: false all_suvs <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/blacke6_xavier_edu/EWMMLYx6aLhDoxjkHGoLrDsBCy6tduSMoodm9Qaos2MF8A?download=1") # Clean the data all_suvs <- all_suvs %>% select(-rank_in_sub_cat) %>% #with only cars pulled from each sub category, this information was not helpful (the only values were 1, 2, and 3) mutate(cost_to_drive = cost_to_drive %>% str_replace_all("\\$", "") %>% str_replace_all("/mo", "") %>% str_trim() %>% as.numeric(), owner_stars = owner_stars %>% str_remove_all("out of 5 stars") %>% as.numeric(), num_owner_reviews = num_owner_reviews %>% str_remove_all("Owner Reviews") %>% as.numeric(), mpg = mpg %>% str_replace_all("[^0-9.]", "") %>% na_if("") %>% as.numeric(), car_price = car_price %>% str_replace_all("\\$", "") %>% str_replace_all(",", "") %>% str_trim() %>% str_replace_all(" - ", "-") %>% map_chr(~ ifelse(str_detect(., "-"), mean(as.numeric(str_split(., "-")[[1]])), .)) %>% as.numeric() %>% round()) ``` ## Part 1 ### Summary Statistics ```{r} #| label: summary statistics # Create a summary statistics data frame with variables as rows summary_stats <- all_suvs %>% summarise( avg_cost_to_drive = mean(cost_to_drive, na.rm = TRUE), median_cost_to_drive = median(cost_to_drive, na.rm = TRUE), sd_cost_to_drive = sd(cost_to_drive, na.rm = TRUE), min_cost_to_drive = min(cost_to_drive, na.rm = TRUE), max_cost_to_drive = max(cost_to_drive, na.rm = TRUE), avg_owner_stars = mean(owner_stars, na.rm = TRUE), median_owner_stars = median(owner_stars, na.rm = TRUE), sd_owner_stars = sd(owner_stars, na.rm = TRUE), min_owner_stars = min(owner_stars, na.rm = TRUE), max_owner_stars = max(owner_stars, na.rm = TRUE), avg_num_reviews = mean(num_owner_reviews, na.rm = TRUE), median_num_reviews = median(num_owner_reviews, na.rm = TRUE), sd_num_reviews = sd(num_owner_reviews, na.rm = TRUE), min_num_reviews = min(num_owner_reviews, na.rm = TRUE), max_num_reviews = max(num_owner_reviews, na.rm = TRUE), avg_mpg = mean(mpg, na.rm = TRUE), median_mpg = median(mpg, na.rm = TRUE), sd_mpg = sd(mpg, na.rm = TRUE), min_mpg = min(mpg, na.rm = TRUE), max_mpg = max(mpg, na.rm = TRUE), avg_car_price = mean(car_price, na.rm = TRUE), median_car_price = median(car_price, na.rm = TRUE), sd_car_price = sd(car_price, na.rm = TRUE), min_car_price = min(car_price, na.rm = TRUE), max_car_price = max(car_price, na.rm = TRUE) ) summary_stats_transposed <- summary_stats %>% pivot_longer(cols = everything(), names_to = c("variable", "statistic"), names_pattern = "^(.*?)_(.*)$") %>% pivot_wider(names_from = statistic, values_from = value) # Display as a neat table using kable kable(summary_stats_transposed, format = "html", caption = "Transposed Summary Statistics") ``` ### Which brand has the most cars on the list? Because the list consists of the top 3 SUVs in each sub-category (such as Small 3 Row and Midsize Luxury), the brand with the most cars on the list is likely a brand that consistently produces quality vehicles. ```{r} #| label: most cars on list all_suvs <- all_suvs %>% mutate(brand = case_when( str_detect(car_name, "Cadillac") ~ "Cadillac", str_detect(car_name, "BMW") ~ "BMW", str_detect(car_name, "Mercedes") ~ "Mercedes", str_detect(car_name, "Audi") ~ "Audi", str_detect(car_name, "Porsche") ~ "Porsche", str_detect(car_name, "Tesla") ~ "Tesla", str_detect(car_name, "Rover") ~ "Land Rover", str_detect(car_name, "Bentley") ~ "Bentley", str_detect(car_name, "Lincoln") ~ "Lincoln", str_detect(car_name, "Acura") ~ "Acura", str_detect(car_name, "Lexus") ~ "Lexus", str_detect(car_name, "Genesis") ~ "Genesis", str_detect(car_name, "Ford") ~ "Ford", str_detect(car_name, "Chevy") ~ "Chevy", str_detect(car_name, "GMC") ~ "GMC", str_detect(car_name, "Toyota") ~ "Toyota", str_detect(car_name, "Hyundai") ~ "Hyundai", str_detect(car_name, "Kia") ~ "Kia", str_detect(car_name, "Jeep") ~ "Jeep", str_detect(car_name, "Honda") ~ "Honda", str_detect(car_name, "Mazda") ~ "Mazda", str_detect(car_name, "Buick") ~ "Buick" )) all_suvs %>% count(brand, name = "count") %>% # Count occurrences of each brand ggplot(aes(x = reorder(brand, -count), y = count)) + # Order brands by count geom_bar(stat = "identity", fill = "steelblue") + # Create bar chart labs( title = "Number of Cars by Brand", x = "Brand", y = "Number of Cars" ) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels ``` #### Analysis Mercedes is the clear front runner with 7 cars mentioned on the list, compared to the next highest of Audi at 4 cars. I found it more useful to compare the number of cars from each brand that made the list rather than the mean or median rating because all of the cars on the list are considered the best in their respective sub-categories. Therefore, there is not much difference in the mean and median values of their total rating by Edmunds. ### Is the overall rating from Edmunds experts aligned with the owner ratings? While the experts st Edmunds likely have a lot of technical knowledge of what makes a "good" car, I myself am not a car enthusiast and likely don't prioritize all the same features in a car that experts do. I feel that the opinions of common people who drive the cars regularly would more accurately predict how I might rate a car. ```{r} #| label: owner vs edumnds rating all_suvs <- all_suvs %>% mutate(owner_stars_2x = owner_stars * 2) # Create the box plot all_suvs %>% mutate(metric = "Owner Stars 2x") %>% # Add a category for owner_stars_2x select(owner_stars_2x, total_rating) %>% pivot_longer(cols = everything(), names_to = "metric", values_to = "value") %>% ggplot(aes(x = metric, y = value, fill = metric)) + geom_boxplot() + stat_summary(fun = median, geom = "text", aes(label = round(..y.., 1)), color = "black", vjust = -0.5, size = 3.5) + labs( title = "Owner vs Expert Rating", x = "Review Type", y = "Value" ) + theme_minimal() + theme(legend.position = "none") ``` #### Analysis It appears that at an aggregate level, while the median of the owner reviews and the expert reviews are virtually the same (8 vs 8.1), the owner reviews have vastly more variation. This makes sense as common consumers are likely to have more variation in their standards and preferences than experts. Additionally, there are more total owner reviews than expert reviews, meaning there is more opportunity for variation with owner reviews, but, as the central limit theorem suggests, a greater likelihood that the median of this larger sample size will more accurately reflect the true median. ### Which car has the best value and how much does it cost? ```{r} #| label: best value # Find the highest value rate highest_value_rate <- max(all_suvs$value_rate, na.rm = TRUE) # Filter the data for cars with the highest value rate highest_value_cars <- all_suvs %>% filter(value_rate == highest_value_rate) %>% select(car_name, value_rate, car_price) # Display the highest value cars as a neat table using kable kable(highest_value_cars, format = "html", caption = "Cars with the Highest Value Rate") ``` ```{r} # Calculate median car price and value rate median_values <- data.frame( median_car_price = median(all_suvs$car_price, na.rm = TRUE), median_value_rate = median(all_suvs$value_rate, na.rm = TRUE) ) # Display the median values as a neat table using kable kable(median_values, format = "html", caption = "Median Car Price and Value Rate") ``` #### Analysis: The four cars tide for the highest value all have a value rating one full point above the median and prices well below the median. It's also worth noting that two of the cars tied for best value are Kias, suggesting that this might be a more budget friendly alternative to Mercedes, which has the most total cars on the list. ### Is there a correlation between MPG and price? ```{r} #| title: mpg vs price # Create a scatter plot comparing mpg and car price all_suvs %>% ggplot(aes(x = mpg, y = car_price)) + geom_point(color = "steelblue") + geom_smooth(method = "lm", color = "red", se = FALSE) + labs( title = "Scatter Plot of MPG vs Car Price", x = "Miles Per Gallon (MPG)", y = "Car Price" ) + theme_minimal() ``` #### Analysis: Yes, there is a negative correlation between price and MPG. This is likely due to the fact that performance vehicles (which tend to be more expensive) often prioritize power over fuel efficiency. ### Is there a correlation between tech rating and price ```{r} # Create a box plot with tech_rate on the x-axis and car_price on the y-axis, grouped by tech_rate all_suvs %>% ggplot(aes(x = factor(tech_rate), y = car_price, fill = factor(tech_rate))) + geom_boxplot() + labs( title = "Box Plot of Car Price by Tech Rate", x = "Tech Rate", y = "Car Price" ) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for readability ``` #### Analysis: Yes, it appears that the better the tech is in a car, the higher the price. However, it is worth noting that that cars with a tech rating of 9 have a wide range of prices, meaning that it is possible to get a car with high quality tech without breaking the bank. ## Part 2 Two very comparable SUV brands in my price range are Lexus and Acura, so I will be comparing owner reviews sourced from Edmunds.com of two of their most popular SUV models: the Acura RDX and the Lexus NX. Note: I will refer to a positivity value throughout the analysis. This is a calculated field based on the overall positivity or negativity of the words used to write a review. A negative positivity value indicates a negative review. [Click here](https://myxavier-my.sharepoint.com/:x:/g/personal/blacke6_xavier_edu/Ef-R7Rp2nRVCvd1Ae8cXPgwB7zD6WlmtXA5w1QnWXFLdXw?download=1) to access the exact data set I used. ```{r} #| label: load csv lex_acura_reviews <- read.csv("https://myxavier-my.sharepoint.com/:x:/g/personal/blacke6_xavier_edu/Ef-R7Rp2nRVCvd1Ae8cXPgwB7zD6WlmtXA5w1QnWXFLdXw?download=1") ``` ```{r} #| label: prepare the data #tokenize each review by word tidy_reviews <- lex_acura_reviews %>% unnest_tokens(word, review) %>% anti_join(stop_words) #join the review data to the NRC lexicon nrc <- get_sentiments("nrc") suv_sentiments <- tidy_reviews %>% inner_join(nrc, by = "word", relationship = "many-to-many") %>% group_by(sentiment, user_name) %>% summarize(n = n()) %>% pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% mutate(positivity = positive-negative) #group sentiment analysis at the review level reviews_w_sentiments <- lex_acura_reviews %>% inner_join(suv_sentiments, by = "user_name") %>% mutate(date = ymd(date)) ``` ```{r} # Filter rows using str_detect total_rating_df <- all_suvs %>% filter(str_detect(car_name, "Lexus NX|Acura RDX")) # Print as a neat table kable(total_rating_df, format = "html", caption = "Total Ratings for Lexus NX and Acura RDX") ``` ### Exploration: What are the top words used to describe each car? Do they differ between the two cars? ```{r} top_words <- tidy_reviews %>% count(car_model, word, sort = TRUE) %>% # Count the frequency of each word per car model group_by(car_model) %>% top_n(10, n) %>% # Get the top 10 most frequent words for each car model ungroup() ggplot(top_words, aes(x = reorder(word, n), y = n, fill = car_model)) + geom_bar(stat = "identity") + coord_flip() + # Flip the coordinates to make it horizontal facet_wrap(~ car_model, scales = "free_y") + # Separate the plots by car model (Lexus and Acura) labs( title = "Top 10 Most Frequently Used Words in Lexus and Acura Reviews", x = "Word", y = "Frequency", fill = "Car Model" ) + geom_text(aes(label = n), hjust = -0.2, color = "black", size = 3) + # Add labels to the bars theme_minimal() + theme( axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels for readability legend.position = "top" ) ``` #### Analysis Note: the word "5" for the Acura was used 391 times (not 3) but was cut off The top 10 words used to describe both cars were extremely similar, with "5" ,"stars", and car being the top 3 words used for both. These common words highlight the general qualities of a car that the reviews value, such as technology, comfort, and reliability. ### Is there a correlation between the month and the number of reviews published? My hypothesis is that there would be more reviews around December and January, as these are generally the most popular times to purchase a car and it seems likely that people would review the car while it is relatively new to them. ```{r} #| label: question 1 # Add a column for the month reviews_w_sentiments <- reviews_w_sentiments %>% mutate(month = floor_date(date, unit = "month")) # Extract the month from the date # Group by month and car_model, then count the number of reviews monthly_reviews_count <- reviews_w_sentiments %>% group_by(month, car_model) %>% summarize(review_count = n(), .groups = "drop") # Create the bar graph ggplot(monthly_reviews_count, aes(x = month, y = review_count, fill = car_model)) + geom_bar(stat = "identity", position = "dodge") + # Use dodge to separate bars by car_model labs( title = "Number of Reviews by Month and Car Model", x = "Month", y = "Number of Reviews", fill = "Car Model" ) + scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") + # Format x-axis for months theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for readability ``` #### Analysis New model years typically being released in December, coupled with Christmas and promotions (such as the Lexus December to Remember Sales Event), makes December one of the most popular months to purchase a car. It would be reasonable to infer that people are more likely to review cars that they have recently bought, therefore causing a spike in car reviews around December, January, and February. While this pattern generally holds true, Lexus seems to have the most significant spike in reviews in December, while Acura's largest spike happens in May. This suggests that there are likely other variables that impact when a consumer chooses to purchase a review outside of how recently they bought the car, or that Acura has particularly effective sales events in "off" months that Lexus does not have. ### Is one car reviewed more positively than the other? I examined the distribution of the overall positivity scores of each car to understand where the majority of reviews fell and how outliers impacted the scores. ```{r} ggplot(reviews_w_sentiments, aes(x = positivity, fill = car_model)) + geom_histogram(binwidth = 1, position = "dodge", alpha = 0.7, color = "black") + facet_wrap(~ car_model) + # Facet by car_model to create separate plots geom_vline( data = reviews_w_sentiments %>% group_by(car_model) %>% summarize(median_positivity = median(positivity, na.rm = TRUE)), aes(xintercept = median_positivity, color = car_model), linetype = "dashed", size = 1) + # Add dashed lines for median labs( title = "Distribution of Positivity Scores for Lexus and Acura", x = "Positivity Score", y = "Count", fill = "Car Model", color = "Car Model" ) + theme_minimal() + theme( legend.position = "top", # Position the legend on top axis.text.x = element_text(angle = 45, hjust = 1) # Rotate x-axis labels for better readability ) ``` ```{r} median_positivity_by_model <- reviews_w_sentiments %>% group_by(car_model) %>% summarize(median_positivity = median(positivity, na.rm = TRUE)) # Print the result print(median_positivity_by_model) ``` #### Analysis Both cars have very similar median positivity values, with Acura having a median value of 5 and Lexus having a median value of 5.5. Outliers can likely be attributed to excessively long reviews, which would have a higher number of scoreable words. The similarity in positivity ratings is unsurprising, as both cars have very similar over-all ratings on the Edmunds website, with the Acura having a 4.1/5 star rating and the Lexus having a 4.2/5 star rating. ### Is the positivity value an accurate reflection of the reviewer's feelings about the car? To answer this question, I compared the star value that the reviewer assigned the car with the calculated positivity value to understand if a higher star value correlates to a higher positivity score. ```{r} ggplot(reviews_w_sentiments, aes(x = as.factor(stars), y = positivity, fill = car_model)) + geom_boxplot(alpha = 0.7, outlier.shape = 16, outlier.colour = "red") + # Box plot with outliers highlighted in red labs( title = "Positivity Score Distribution by Star Rating", x = "Star Rating", y = "Positivity Score", fill = "Car Model" ) + theme_minimal() + theme( legend.position = "top", # Position the legend on top axis.text.x = element_text(angle = 45, hjust = 1) # Rotate x-axis labels for readability ) ``` #### Analysis There appears to be a general trend that the higher the star rating the reviewer gave, the higher the overall positivity was of the language they used in their reviews. This suggets that the results of the text analysis are generally reliable. However, it should be noted that the overlapping margins of error suggest that further ANOVA is needed to understand the statistical relationship between these two variables.