First, loading the library and the dataset.

# Load necessary libraries
library(ggplot2)
library(dplyr)
library(tidyr)
library(purrr)

# Load the dataset
coffee <- read.csv("F:/CP8883/coffee.csv")

Now I’m going to recreate all the plots, and provide some insights for each.
Note that I’ve added titles and labelled axes for better understanding (not included in some of the plots in the instruction).

Plot 1.

ggplot(coffee, aes(x = as.factor(avg_rating), y = hhincome)) +
  geom_boxplot() +
  labs(title = "Average Rating vs Household Income",
       x = "Average Rating",
       y = "Household Income (Median Annual)") +
  theme_gray()

Findings: As seen from the boxplot, businesses with a rating of 3 and 4 tend to be located in areas with a higher median income, with some outliers extending beyond $150,000. On the other hand, businesses with an average rating of 1 and 2 tend to be in lower-income areas. This suggests a potential relationship where businesses in wealthier areas receive higher Yelp ratings, though the spread within each rating category is significant. Interestingly, businesses with an average Yelp rating of 5 tend to be located in areas with lower median household income, which contrasts with the general trend observed for other ratings. While one might expect higher-rated businesses to be in wealthier areas, this plot suggests that businesses in lower-income areas can still achieve excellent customer satisfaction. This anomaly may indicate that factors other than income are influencing these high ratings.

Plot 2.

ggplot(coffee, aes(x = as.factor(avg_rating), y = hhincome)) +
  geom_boxplot() +
  facet_wrap(~ county) +
  labs(title = "Average Rating vs Household Income by County",
       x = "Average Yelp Rating",
       y = "Median Annual Household Income ($)") +
  theme_gray()

Findings: The data reveals variability between counties, with Fulton and Cobb showing a broader spread of income levels across all rating categories. DeKalb County shows an increasing trend in income as ratings increase, suggesting businesses in wealthier neighborhoods tend to get higher ratings. Clayton and Gwinnett Counties show less variation, with most businesses clustering around lower or middle income brackets regardless of rating. This analysis shows that the relationship between Yelp ratings and income varies by county, as well as reflecting distinct income distributions across counties. Besides, similar to the overall trend seen in Plot 1, ratings of 5 seem to cluster around lower-income areas.

Plot 3.

ggplot(coffee, aes(x = review_count_log, y = hhincome, color = pct_white)) +
  geom_point(alpha = 0.6, size = 3) +
  facet_wrap(~ county, ncol = 3) +  # Keep ncol = 3 for 3 columns
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Scatterplot: Review Count vs. Household Income",
       x = "Review Count (log)",
       y = "Median Annual Household Income",
       color = "Proportion of residents \nwho self-identified as white") +
  theme_minimal() +
  theme( # aesthetics application below
    strip.background = element_rect(fill = "gray90", color = "black", size = 0.3),
    strip.text = element_text(size = 8),
    panel.border = element_rect(color = "black", fill = NA, size = 0.3),
    panel.grid.major = element_line(size = 0.5),
  )

Findings: The color gradient indicates that areas with a higher proportion of white residents tend to have more reviews, particularly in Fulton and Cobb counties. There is a weak positive correlation between household income and review count in most counties, though the strength of this relationship varies. For example, in DeKalb and Clayton counties, lower-income areas tend to have fewer reviews. This trend suggests that both demographic and economic factors play a role in influencing customer engagement with businesses. Additionally, businesses in predominantly white and wealthier areas are likely to attract more reviews, indicating a potential relationship between community demographics and online business visibility.

Plot 4.

First I’m going to preform several data and label preparations.

# Reshape the data using pivot_longer()
coffee_long <- coffee %>%
  pivot_longer(cols = c(hhincome, pct_pov_log, pct_white, pop),
               names_to = "variable", values_to = "value")

# Compute R and p-values for each county and variable combination
cor_results_overall <- coffee_long %>%
  group_by(variable) %>%
  summarise(cor_test = list(cor.test(review_count_log, value, method = "pearson"))) %>%
  mutate(
    R = map_dbl(cor_test, ~ .x$estimate),
    p_value = map_dbl(cor_test, ~ .x$p.value)
  )

# Add formatted R and p-values as labels
cor_results_overall <- cor_results_overall %>%
  mutate(
    label = paste0("R = ", signif(R, 2), ", p = ", signif(p_value, 2))
  )

Now, plotting.

ggplot(coffee_long, aes(x = review_count_log, y = value, color = county)) +
  geom_point(alpha = 1) +
  geom_smooth(method = "lm", se = FALSE) +  # Add trend lines
  facet_wrap(~ variable, scales = "free", labeller = labeller(variable = c(
    hhincome = "Median Annual Household Income ($)",
    pct_pov_log = "Percent Residents Under Poverty (log)",
    pct_white = "Percent White Resident",
    pop = "Total Population"
  ))) +
  geom_text(data = cor_results_overall, aes(x = 1, y = Inf, label = label),
            vjust = 1.5, hjust = 0.2, inherit.aes = FALSE, size = 3) +
  labs(title = "Scatterplot between logged review count & neighborhood characteristics",
       subtitle = "Using Yelp data in Five Counties Around Atlanta, GA",
       x = "Review Count Logged",
       y = "Values",
       color = "County") +
  theme_minimal() +
  theme(strip.background = element_rect(fill = "gray90", color = "black", size = 0.3),
        strip.text = element_text(size = 8),
        panel.border = element_rect(color = "black", fill = NA, size = 0.3))

Findings: The top-left plot shows a weak positive correlation between household income and review count (R = 0.13, p = 0.011), while the percentage of residents under poverty has a weak negative correlation with review count (R = -0.19, p = 0.00032). The percentage of white residents shows a stronger positive correlation with review count (R = 0.28, p = 7.5e-08), while total population has no significant relationship (R = -0.017, p = 0.75). These results suggest that while demographics like race and income play a role in the number of reviews a business receives, population size does not seem to be a significant factor.

Mini4

Zihan Weng

2024-10-10

Plot 1.

Plot 2.

Plot 3.

Plot 4.