upload data/ install libraries
coffee <- read.csv("coffee.csv")
library(ggpubr)
## Loading required package: ggplot2
library(ggplot2)
library(patchwork)
library(tidyr)
library(skimr)
skim(coffee)
| Name | coffee |
| Number of rows | 363 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| county | 0 | 1 | 11 | 15 | 0 | 5 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| X | 0 | 1.00 | 1.820000e+02 | 104.93 | 1.000000e+00 | 9.150000e+01 | 1.8200e+02 | 2.725000e+02 | 3.630000e+02 | ▇▇▇▇▇ |
| GEOID | 0 | 1.00 | 1.310284e+10 | 26584603.18 | 1.306304e+10 | 1.306703e+10 | 1.3121e+10 | 1.312101e+10 | 1.313505e+10 | ▃▃▁▁▇ |
| hhincome | 0 | 1.00 | 7.831331e+04 | 36236.17 | 1.248500e+04 | 5.034400e+04 | 7.1482e+04 | 9.604200e+04 | 2.361490e+05 | ▆▇▃▁▁ |
| pct_pov | 0 | 1.00 | 1.300000e-01 | 0.10 | 1.000000e-02 | 6.000000e-02 | 1.1000e-01 | 1.900000e-01 | 7.700000e-01 | ▇▃▁▁▁ |
| review_count | 0 | 1.00 | 6.576000e+01 | 98.08 | 1.000000e+00 | 2.225000e+01 | 4.0250e+01 | 7.150000e+01 | 1.326000e+03 | ▇▁▁▁▁ |
| avg_rating | 0 | 1.00 | 3.090000e+00 | 1.04 | 1.000000e+00 | 2.000000e+00 | 3.0000e+00 | 4.000000e+00 | 5.000000e+00 | ▁▇▅▇▁ |
| race.tot | 0 | 1.00 | 6.364500e+03 | 3322.49 | 1.254000e+03 | 4.170500e+03 | 5.8790e+03 | 7.914000e+03 | 2.639900e+04 | ▇▅▁▁▁ |
| avg_price | 39 | 0.89 | 1.350000e+00 | 0.48 | 1.000000e+00 | 1.000000e+00 | 1.0000e+00 | 2.000000e+00 | 2.000000e+00 | ▇▁▁▁▅ |
| pct_white | 0 | 1.00 | 4.900000e-01 | 0.26 | 0.000000e+00 | 3.200000e-01 | 5.1000e-01 | 7.100000e-01 | 9.600000e-01 | ▅▅▇▇▅ |
| hhincome_log | 0 | 1.00 | 1.117000e+01 | 0.46 | 9.430000e+00 | 1.083000e+01 | 1.1180e+01 | 1.147000e+01 | 1.237000e+01 | ▁▂▇▇▂ |
| review_count_log | 0 | 1.00 | 3.430000e+00 | 1.01 | 6.900000e-01 | 2.920000e+00 | 3.4800e+00 | 3.990000e+00 | 7.190000e+00 | ▁▅▇▂▁ |
| pct_pov_log | 0 | 1.00 | -2.190000e+00 | 0.72 | -3.990000e+00 | -2.720000e+00 | -2.1500e+00 | -1.620000e+00 | -2.500000e-01 | ▂▆▇▆▁ |
| yelp_n | 0 | 1.00 | 2.540000e+00 | 2.20 | 1.000000e+00 | 1.000000e+00 | 2.0000e+00 | 3.000000e+00 | 1.900000e+01 | ▇▁▁▁▁ |
head(coffee)
Plot 1. Variables used - avg_rating, hhincome
bxplot <- ggplot(data = coffee) +
geom_boxplot(aes(x=avg_rating, y=hhincome),
color="black",fill="white")
plotly::ggplotly(bxplot)
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
This first plot shows that there is some correlation between median household income and the rating of local coffee shops. It seems that increasing household income is associated with higher reviewed coffee shops up to 4 stars. Whether this is to do with the mentality of higher income households, or a general trend in reviews, 5 star coffee shops are associated with a lower household income. Could be indicating an unwillingess to give full five star reviews when you have experienced all levels of high quality services or products or that even coffee shops in nice areas are unable to reach a certain level of quality by not offering the full services of a restaurant.
Plot 2. Variables used - avg_rating, hhincome, county
bxplot2 <- ggplot(data = coffee) +
geom_boxplot(aes(x=avg_rating, y=hhincome),
color="black",fill="white") +
facet_wrap(~county)
plotly::ggplotly(bxplot2)
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
This second plot starts to show that income levels are closely associated with county, where Cobb County and Fulton County have higher median household income levels across all average ratings. This also shows that there are more outliers on the upper end of income. Cobb, Fulton, and Gwinnett Counties follow the pattern of average rating increasing in tracts with higher median income, up until 5 star average ratings.
Plot 3. Variables used - review_count_log, hhincome, county, pct_white
ggplot(data = coffee) +
geom_point(mapping = aes(x=review_count_log, y=hhincome,
color=pct_white), alpha=0.4) +
scale_color_gradient(low="darkblue", high="red") +
facet_wrap(~county)+
ggtitle("Scatterplot: Review Count vs. Household Income")
This plot shows that the higher median household income, the higher the
percentage of white people are in a tract. It also seems to indicate
that the percent of white people is not so strongly associated with
average rating. This plot makes the association with income and average
review appear weaker as compared to the box plots.
Plot 4. Variables used - pct_pov_log, hhincome, pct_white, race.tot, review_count_log, county
pivot longer
coffee_long <- tidyr::pivot_longer(coffee, c(hhincome,pct_pov_log, pct_white, race.tot), names_to = "variable", values_to = "Value")
head(coffee_long)
facet_labels <- c("hhincome" = "Median Annual Household Income", "pct_pov_log" = "Percent Residents Under Poverty", "pct_white" = "Percent White Residents", "race.tot" = "Total Population")
scpl <- ggplot(coffee_long, aes(x = review_count_log, y = Value, color = county)) +
geom_point() +
geom_smooth(mapping = aes(x = review_count_log, y = Value, color = county), method = "lm", se = FALSE) +
ggtitle("Scatterplot: Review Count vs. Neighborhood Characteristics", subtitle = "Using Yelp Data from 5 Counties Around Atlanta, GA") +
labs(x = "Review Counts Logged", # Add x-axis label
y = "Value") + # Add y-axis label
stat_cor(aes(x = review_count_log, y = Value, group = variable), label.x.npc = .25,
label.y.npc = 1.0,
vjust = 1) +
facet_wrap(~variable, scales = "free_y", labeller = as_labeller(facet_labels)) +
theme(plot.title = element_text(lineheight = 1.5), # Adjust line height
plot.subtitle = element_text(lineheight = 1, size = 10)) # Adjust subtitle font size
print(scpl)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
These last plots really shows how different the associations are when broken down by county, but among variables, none of the associations are that strong. The highest absolute value of a R coefficient is 0.28 for Percent of White Residents, while the others are less than abs 0.2. This indicates that the percentage of white residents is the strongest indicator of an increase in reviews of coffee shops. Looking at individual counties within each neighborhood characteristic allows you to draw more conclusions. For example, the plot that shows percent of white residents indicates that tracts with a higher percentage in DeKalb county does tend to increase with higher average ratings, while the other counties have a flatter regression line. The other variables mostly show flat regression lines, except in the case of DeKalb county again. There is a slight positive relationship between median income and average rating, and a slight negative relationship between residents living under poverty and average rating. This seems to say that these socioeconomic characteristics in DeKalb county are related to this one neighborhood amenity, but in the other counties, not as much.