upload data/ install libraries

coffee <- read.csv("coffee.csv")
library(ggpubr)
## Loading required package: ggplot2
library(ggplot2)
library(patchwork)
library(tidyr)
library(skimr)
skim(coffee)
Data summary
Name coffee
Number of rows 363
Number of columns 14
_______________________
Column type frequency:
character 1
numeric 13
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
county 0 1 11 15 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
X 0 1.00 1.820000e+02 104.93 1.000000e+00 9.150000e+01 1.8200e+02 2.725000e+02 3.630000e+02 ▇▇▇▇▇
GEOID 0 1.00 1.310284e+10 26584603.18 1.306304e+10 1.306703e+10 1.3121e+10 1.312101e+10 1.313505e+10 ▃▃▁▁▇
hhincome 0 1.00 7.831331e+04 36236.17 1.248500e+04 5.034400e+04 7.1482e+04 9.604200e+04 2.361490e+05 ▆▇▃▁▁
pct_pov 0 1.00 1.300000e-01 0.10 1.000000e-02 6.000000e-02 1.1000e-01 1.900000e-01 7.700000e-01 ▇▃▁▁▁
review_count 0 1.00 6.576000e+01 98.08 1.000000e+00 2.225000e+01 4.0250e+01 7.150000e+01 1.326000e+03 ▇▁▁▁▁
avg_rating 0 1.00 3.090000e+00 1.04 1.000000e+00 2.000000e+00 3.0000e+00 4.000000e+00 5.000000e+00 ▁▇▅▇▁
race.tot 0 1.00 6.364500e+03 3322.49 1.254000e+03 4.170500e+03 5.8790e+03 7.914000e+03 2.639900e+04 ▇▅▁▁▁
avg_price 39 0.89 1.350000e+00 0.48 1.000000e+00 1.000000e+00 1.0000e+00 2.000000e+00 2.000000e+00 ▇▁▁▁▅
pct_white 0 1.00 4.900000e-01 0.26 0.000000e+00 3.200000e-01 5.1000e-01 7.100000e-01 9.600000e-01 ▅▅▇▇▅
hhincome_log 0 1.00 1.117000e+01 0.46 9.430000e+00 1.083000e+01 1.1180e+01 1.147000e+01 1.237000e+01 ▁▂▇▇▂
review_count_log 0 1.00 3.430000e+00 1.01 6.900000e-01 2.920000e+00 3.4800e+00 3.990000e+00 7.190000e+00 ▁▅▇▂▁
pct_pov_log 0 1.00 -2.190000e+00 0.72 -3.990000e+00 -2.720000e+00 -2.1500e+00 -1.620000e+00 -2.500000e-01 ▂▆▇▆▁
yelp_n 0 1.00 2.540000e+00 2.20 1.000000e+00 1.000000e+00 2.0000e+00 3.000000e+00 1.900000e+01 ▇▁▁▁▁
head(coffee)

Plot 1. Variables used - avg_rating, hhincome

bxplot <- ggplot(data = coffee) +
  geom_boxplot(aes(x=avg_rating, y=hhincome),
               color="black",fill="white") 
  

plotly::ggplotly(bxplot)
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

This first plot shows that there is some correlation between median household income and the rating of local coffee shops. It seems that increasing household income is associated with higher reviewed coffee shops up to 4 stars. Whether this is to do with the mentality of higher income households, or a general trend in reviews, 5 star coffee shops are associated with a lower household income. Could be indicating an unwillingess to give full five star reviews when you have experienced all levels of high quality services or products or that even coffee shops in nice areas are unable to reach a certain level of quality by not offering the full services of a restaurant.

Plot 2. Variables used - avg_rating, hhincome, county

bxplot2 <- ggplot(data = coffee) +
  geom_boxplot(aes(x=avg_rating, y=hhincome),
               color="black",fill="white") +
  facet_wrap(~county)
plotly::ggplotly(bxplot2)
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

This second plot starts to show that income levels are closely associated with county, where Cobb County and Fulton County have higher median household income levels across all average ratings. This also shows that there are more outliers on the upper end of income. Cobb, Fulton, and Gwinnett Counties follow the pattern of average rating increasing in tracts with higher median income, up until 5 star average ratings.

Plot 3. Variables used - review_count_log, hhincome, county, pct_white

ggplot(data = coffee) +
  geom_point(mapping = aes(x=review_count_log, y=hhincome, 
                           color=pct_white), alpha=0.4) +
  scale_color_gradient(low="darkblue", high="red") +
  facet_wrap(~county)+
  ggtitle("Scatterplot: Review Count vs. Household Income")

This plot shows that the higher median household income, the higher the percentage of white people are in a tract. It also seems to indicate that the percent of white people is not so strongly associated with average rating. This plot makes the association with income and average review appear weaker as compared to the box plots.

Plot 4. Variables used - pct_pov_log, hhincome, pct_white, race.tot, review_count_log, county

pivot longer

coffee_long <- tidyr::pivot_longer(coffee, c(hhincome,pct_pov_log, pct_white, race.tot), names_to = "variable", values_to = "Value")
head(coffee_long)
facet_labels <- c("hhincome" = "Median Annual Household Income", "pct_pov_log" = "Percent Residents Under Poverty", "pct_white" = "Percent White Residents", "race.tot" = "Total Population")

scpl <- ggplot(coffee_long, aes(x = review_count_log, y = Value, color = county)) +
  geom_point() + 
  geom_smooth(mapping = aes(x = review_count_log, y = Value, color = county), method = "lm", se = FALSE) +
  ggtitle("Scatterplot: Review Count vs. Neighborhood Characteristics", subtitle = "Using Yelp Data from 5 Counties Around Atlanta, GA") +
  labs(x = "Review Counts Logged",  # Add x-axis label
       y = "Value") + # Add y-axis label
  stat_cor(aes(x = review_count_log, y = Value, group = variable), label.x.npc = .25,
           label.y.npc = 1.0,
           vjust = 1) +
  facet_wrap(~variable, scales = "free_y", labeller = as_labeller(facet_labels)) +
  theme(plot.title = element_text(lineheight = 1.5),  # Adjust line height
        plot.subtitle = element_text(lineheight = 1, size = 10))  # Adjust subtitle font size
print(scpl) 
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

These last plots really shows how different the associations are when broken down by county, but among variables, none of the associations are that strong. The highest absolute value of a R coefficient is 0.28 for Percent of White Residents, while the others are less than abs 0.2. This indicates that the percentage of white residents is the strongest indicator of an increase in reviews of coffee shops. Looking at individual counties within each neighborhood characteristic allows you to draw more conclusions. For example, the plot that shows percent of white residents indicates that tracts with a higher percentage in DeKalb county does tend to increase with higher average ratings, while the other counties have a flatter regression line. The other variables mostly show flat regression lines, except in the case of DeKalb county again. There is a slight positive relationship between median income and average rating, and a slight negative relationship between residents living under poverty and average rating. This seems to say that these socioeconomic characteristics in DeKalb county are related to this one neighborhood amenity, but in the other counties, not as much.