Mini 4

upload data/ install libraries

coffee <- read.csv("coffee.csv")
library(ggpubr)

## Loading required package: ggplot2

library(ggplot2)
library(patchwork)
library(tidyr)
library(skimr)

skim(coffee)

Data summary
Name	coffee
Number of rows	363
Number of columns	14
_______________________
Column type frequency:
character	1
numeric	13
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
county	0	1	11	15	0	5	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
X	0	1.00	1.820000e+02	104.93	1.000000e+00	9.150000e+01	1.8200e+02	2.725000e+02	3.630000e+02	▇▇▇▇▇
GEOID	0	1.00	1.310284e+10	26584603.18	1.306304e+10	1.306703e+10	1.3121e+10	1.312101e+10	1.313505e+10	▃▃▁▁▇
hhincome	0	1.00	7.831331e+04	36236.17	1.248500e+04	5.034400e+04	7.1482e+04	9.604200e+04	2.361490e+05	▆▇▃▁▁
pct_pov	0	1.00	1.300000e-01	0.10	1.000000e-02	6.000000e-02	1.1000e-01	1.900000e-01	7.700000e-01	▇▃▁▁▁
review_count	0	1.00	6.576000e+01	98.08	1.000000e+00	2.225000e+01	4.0250e+01	7.150000e+01	1.326000e+03	▇▁▁▁▁
avg_rating	0	1.00	3.090000e+00	1.04	1.000000e+00	2.000000e+00	3.0000e+00	4.000000e+00	5.000000e+00	▁▇▅▇▁
race.tot	0	1.00	6.364500e+03	3322.49	1.254000e+03	4.170500e+03	5.8790e+03	7.914000e+03	2.639900e+04	▇▅▁▁▁
avg_price	39	0.89	1.350000e+00	0.48	1.000000e+00	1.000000e+00	1.0000e+00	2.000000e+00	2.000000e+00	▇▁▁▁▅
pct_white	0	1.00	4.900000e-01	0.26	0.000000e+00	3.200000e-01	5.1000e-01	7.100000e-01	9.600000e-01	▅▅▇▇▅
hhincome_log	0	1.00	1.117000e+01	0.46	9.430000e+00	1.083000e+01	1.1180e+01	1.147000e+01	1.237000e+01	▁▂▇▇▂
review_count_log	0	1.00	3.430000e+00	1.01	6.900000e-01	2.920000e+00	3.4800e+00	3.990000e+00	7.190000e+00	▁▅▇▂▁
pct_pov_log	0	1.00	-2.190000e+00	0.72	-3.990000e+00	-2.720000e+00	-2.1500e+00	-1.620000e+00	-2.500000e-01	▂▆▇▆▁
yelp_n	0	1.00	2.540000e+00	2.20	1.000000e+00	1.000000e+00	2.0000e+00	3.000000e+00	1.900000e+01	▇▁▁▁▁

head(coffee)

Plot 1. Variables used - avg_rating, hhincome

bxplot <- ggplot(data = coffee) +
  geom_boxplot(aes(x=avg_rating, y=hhincome),
               color="black",fill="white") 
  

plotly::ggplotly(bxplot)

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

This first plot shows that there is some correlation between median household income and the rating of local coffee shops. It seems that increasing household income is associated with higher reviewed coffee shops up to 4 stars. Whether this is to do with the mentality of higher income households, or a general trend in reviews, 5 star coffee shops are associated with a lower household income. Could be indicating an unwillingess to give full five star reviews when you have experienced all levels of high quality services or products or that even coffee shops in nice areas are unable to reach a certain level of quality by not offering the full services of a restaurant.

Plot 2. Variables used - avg_rating, hhincome, county

bxplot2 <- ggplot(data = coffee) +
  geom_boxplot(aes(x=avg_rating, y=hhincome),
               color="black",fill="white") +
  facet_wrap(~county)
plotly::ggplotly(bxplot2)

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

This second plot starts to show that income levels are closely associated with county, where Cobb County and Fulton County have higher median household income levels across all average ratings. This also shows that there are more outliers on the upper end of income. Cobb, Fulton, and Gwinnett Counties follow the pattern of average rating increasing in tracts with higher median income, up until 5 star average ratings.

Plot 3. Variables used - review_count_log, hhincome, county, pct_white

ggplot(data = coffee) +
  geom_point(mapping = aes(x=review_count_log, y=hhincome, 
                           color=pct_white), alpha=0.4) +
  scale_color_gradient(low="darkblue", high="red") +
  facet_wrap(~county)+
  ggtitle("Scatterplot: Review Count vs. Household Income")

This plot shows that the higher median household income, the higher the percentage of white people are in a tract. It also seems to indicate that the percent of white people is not so strongly associated with average rating. This plot makes the association with income and average review appear weaker as compared to the box plots.

Plot 4. Variables used - pct_pov_log, hhincome, pct_white, race.tot, review_count_log, county

pivot longer

coffee_long <- tidyr::pivot_longer(coffee, c(hhincome,pct_pov_log, pct_white, race.tot), names_to = "variable", values_to = "Value")
head(coffee_long)

facet_labels <- c("hhincome" = "Median Annual Household Income", "pct_pov_log" = "Percent Residents Under Poverty", "pct_white" = "Percent White Residents", "race.tot" = "Total Population")

scpl <- ggplot(coffee_long, aes(x = review_count_log, y = Value, color = county)) +
  geom_point() + 
  geom_smooth(mapping = aes(x = review_count_log, y = Value, color = county), method = "lm", se = FALSE) +
  ggtitle("Scatterplot: Review Count vs. Neighborhood Characteristics", subtitle = "Using Yelp Data from 5 Counties Around Atlanta, GA") +
  labs(x = "Review Counts Logged",  # Add x-axis label
       y = "Value") + # Add y-axis label
  stat_cor(aes(x = review_count_log, y = Value, group = variable), label.x.npc = .25,
           label.y.npc = 1.0,
           vjust = 1) +
  facet_wrap(~variable, scales = "free_y", labeller = as_labeller(facet_labels)) +
  theme(plot.title = element_text(lineheight = 1.5),  # Adjust line height
        plot.subtitle = element_text(lineheight = 1, size = 10))  # Adjust subtitle font size

print(scpl)

## `geom_smooth()` using formula = 'y ~ x'

## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

These last plots really shows how different the associations are when broken down by county, but among variables, none of the associations are that strong. The highest absolute value of a R coefficient is 0.28 for Percent of White Residents, while the others are less than abs 0.2. This indicates that the percentage of white residents is the strongest indicator of an increase in reviews of coffee shops. Looking at individual counties within each neighborhood characteristic allows you to draw more conclusions. For example, the plot that shows percent of white residents indicates that tracts with a higher percentage in DeKalb county does tend to increase with higher average ratings, while the other counties have a flatter regression line. The other variables mostly show flat regression lines, except in the case of DeKalb county again. There is a slight positive relationship between median income and average rating, and a slight negative relationship between residents living under poverty and average rating. This seems to say that these socioeconomic characteristics in DeKalb county are related to this one neighborhood amenity, but in the other counties, not as much.