UA Mini 4 Visualization for understanding the distribution of POIs across neighborhoods

R Setup

1 Coffee data

The following is an integrated dataset between ACS and Yelp, showing the sociodemographic characteristics of census tracts and the distribution of coffee shops in each census tract.

coffee <- st_read("data/coffee.csv")

:)   Reading layer `coffee' from data source `/Users/jaeglee/Library/CloudStorage/OneDrive-GeorgiaInstituteofTechnology/2024 Fall/UA/UA_R/assignments/data/coffee.csv' using driver `CSV'

coffee

:)     field_1       GEOID         county hhincome     pct_pov review_count avg_rating  pop avg_price   pct_white hhincome_log review_count_log  pct_pov_log yelp_n
:)   1       1 13063040202 Clayton County    33276 0.201342282           57          2 2850         1 0.075087719  10.41289217      4.060443011 -1.554276271      1
:)   2       2 13063040308 Clayton County    28422 0.210718002           13          3 4262         1 0.260675739  10.25527055      1.975621859 -1.510869401      2
:)   3       3 13063040407 Clayton County    49271 0.108255067  29.33333333          2 4046         1  0.20514088  10.80529389      3.320836964 -2.134911405      3
:)   4       4 13063040408 Clayton County    44551 0.180956612           20          4 8489         1 0.168688892  10.70461432      3.044522438  -1.65570904      1
:)   5       5 13063040410 Clayton County    49719 0.114680192           41          1 7166         1 0.193692437  10.81434354      3.737669618 -2.082003287      1
:)    [ reached 'max' / getOption("max.print") -- omitted 358 rows ]

coffee_dtyped <- coffee %>%
  select(-field_1) %>%
  mutate(county = factor(str_split_i(county, " ", 1)),
         hhincome = as.numeric(hhincome),
         pct_pov = as.numeric(pct_pov)*100,
         avg_rating = factor(avg_rating, levels=c("1", "2", "3", "4", "5")),
         avg_price = as.numeric(avg_price),
         
         pct_white = as.numeric(pct_white)*100,
         hhincome_log = as.numeric(hhincome_log),
         review_count_log = as.numeric(review_count_log),
         pct_pov_log = as.numeric(pct_pov_log),
         
         review_count = as.integer(review_count),
         pop = as.integer(pop),
         yelp_n = as.integer(yelp_n))
coffee_dtyped

:)           GEOID  county hhincome pct_pov review_count avg_rating  pop avg_price pct_white hhincome_log review_count_log pct_pov_log yelp_n
:)   1 13063040202 Clayton    33276    20.1           57          2 2850         1      7.51         10.4             4.06       -1.55      1
:)   2 13063040308 Clayton    28422    21.1           13          3 4262         1     26.07         10.3             1.98       -1.51      2
:)   3 13063040407 Clayton    49271    10.8           29          2 4046         1     20.51         10.8             3.32       -2.13      3
:)   4 13063040408 Clayton    44551    18.1           20          4 8489         1     16.87         10.7             3.04       -1.66      1
:)   5 13063040410 Clayton    49719    11.5           41          1 7166         1     19.37         10.8             3.74       -2.08      1
:)    [ reached 'max' / getOption("max.print") -- omitted 358 rows ]

2 Plot 1

In Plot 1, I examined the association between the average ratings of POIs in each census tract and the median household incomes of census tracts . In general, census tracts with higher rating POIs tend to have higher median incomes, implying that higher quality services are provided for those who can afford higher prices. Interestingly, census tracts with the highest ratings (=5) are largely located in lower-income neighborhoods. This might result from the fact that attractive POIs are often in downtown areas where many patrons with high incomes visit but the households residing in that areas often have lower incomes.

coffee_dtyped %>%
  ggplot(aes(x=avg_rating, y=hhincome)) +
  geom_boxplot() +
  labs(x="Average rating", y="Household income") +
  theme_bw()

3 Plot 2

In Plot 2, I expanded what I examined in Plot 1 by splitting the associations based on counties. The general trend seen in Plot 1 is consistent in DeKalb, Fulton, and Gwinnett counties. I guess Clayton and Cobb counties have too a small number of census tracts to draw conclusions on the association between average POI ratings and median household incomes.

coffee_dtyped %>%
  ggplot(aes(x=avg_rating, y=hhincome)) +
  geom_boxplot() +
  facet_wrap(~ county) +
  labs(x="Average yelp rating", y="Median annual household income ($)") +
  theme_bw()

4 Plot 3

I referred to this website to find a wide range of palettes in R. In ggplot, there are many different ways to use palettes. > scale_ + color(or fill) + manual(or viridis, brewer, gradient)

There are also pre-designed palettes for well-regarded journals (Nature, etc.), and they are easily implemented by using functions like scale_color_npg() in the ggsci package.

library(RColorBrewer)
RColorBrewer::display.brewer.all()

library(viridis)
# Other options such as  can also be used.
# viridis(), magma(), plasma(), inferno(), cividis()
scales::show_col(viridis(8))

In Plot 3, I notice a slightly positive association between the average number of reviews on POIs (logged) and the median household incomes in each census tract especially in Cobb, DeKalb, and Fulton counties. I also see that race and ethnicity also play a role. In DeKalb and Fulton counties, communities where whites are the majority tend to have higher median income levels and POIs with greater numbers of reviews. There is not only economic but also racial inequity in access to attractive, popular POIs in the our study area.

coffee_dtyped %>%
  ggplot(aes(x=review_count_log, y=hhincome, col=pct_white)) +
  geom_point(alpha=0.5, size=2) +
  facet_wrap(~ county) +
  scale_colour_gradient(low = "blue", high = "red") + # if want to use a divergent palette with a white color in the middle, use scale_color_gradient2
  labs(x="Review count (log)", y="Median annual household income ($)", color="Percentage of residents\n who self-identified as white (%)") +
  ggtitle("Scatterplot: Review count vs. Household income") +
  theme_bw()

5 Plot 4

coffee_dtyped

:)           GEOID  county hhincome pct_pov review_count avg_rating  pop avg_price pct_white hhincome_log review_count_log pct_pov_log yelp_n
:)   1 13063040202 Clayton    33276    20.1           57          2 2850         1      7.51         10.4             4.06       -1.55      1
:)   2 13063040308 Clayton    28422    21.1           13          3 4262         1     26.07         10.3             1.98       -1.51      2
:)   3 13063040407 Clayton    49271    10.8           29          2 4046         1     20.51         10.8             3.32       -2.13      3
:)   4 13063040408 Clayton    44551    18.1           20          4 8489         1     16.87         10.7             3.04       -1.66      1
:)   5 13063040410 Clayton    49719    11.5           41          1 7166         1     19.37         10.8             3.74       -2.08      1
:)    [ reached 'max' / getOption("max.print") -- omitted 358 rows ]

I used transformed the dataset into the long-format before the visualization in order to leverage the useful facet_wrap() function in ggplot. Four columns, namely “hhincome”, “pct_pov_log”, “pct_white”, and “pop”, were transformed.

levels = c("hhincome", "pct_pov_log", "pct_white", "pop")
labels = c("Median annual household income ($)", "Percent residents under poverty (log)", "Percent white resident", "Total population")
coffee_dtyped_longed <- coffee_dtyped %>%
  pivot_longer(cols=c("hhincome", "pct_pov_log", "pct_white", "pop"),
               names_to="exp_variables",
               values_to="values") %>%
  select(GEOID, county, exp_variables, values, review_count_log) %>%
  mutate(exp_variables = factor(exp_variables,
                                levels=levels,
                                labels=labels))
coffee_dtyped_longed

:)   # A tibble: 1,452 × 5
:)      GEOID       county  exp_variables                           values review_count_log
:)      <chr>       <fct>   <fct>                                    <dbl>            <dbl>
:)    1 13063040202 Clayton Median annual household income ($)    33276                4.06
:)    2 13063040202 Clayton Percent residents under poverty (log)    -1.55             4.06
:)    3 13063040202 Clayton Percent white resident                    7.51             4.06
:)    4 13063040202 Clayton Total population                       2850                4.06
:)    5 13063040308 Clayton Median annual household income ($)    28422                1.98
:)    6 13063040308 Clayton Percent residents under poverty (log)    -1.51             1.98
:)    7 13063040308 Clayton Percent white resident                   26.1              1.98
:)    8 13063040308 Clayton Total population                       4262                1.98
:)    9 13063040407 Clayton Median annual household income ($)    49271                3.32
:)   10 13063040407 Clayton Percent residents under poverty (log)    -2.13             3.32
:)   # ℹ 1,442 more rows

In Plot 4, I first notice that the review counts do not have much to do with the total population in each census tract, because people not only living within the community but also living across the metropolitan area all visit POIs in large sizes and high degrees of attractiveness. Importantly, the three proxies for socioeconomic characterisitics of communities exhibit consistent trends in regards to the review counts. The communities with higher income levels, small percentages of residents under poverty, and higher percentages of white residents tend to have POIs with greater average review counts, presumably more popular and attractive ones. The simple regression lines provide insights into the linear correlation between each proxy and review counts within each county’s context. While the correlation coefficient is measured on the entire sample across the five counties, the steepest regression line in Dekalb county implies that the county has the strongest asymmetries in POI distributions across socioeconomic status.

coffee_dtyped_longed %>%
  mutate(county=as.character(county)) %>%
  ggplot(aes(x=review_count_log, y=values, col=county)) +
  geom_point(size=1) +
  geom_smooth(method=lm, se=F, fullrange=F) +
  ggpubr::stat_cor(aes(x=review_count_log, y=values),
                   method="pearson",
                   p.accuracy = 0.001, r.accuracy = 0.01,
                   inherit.aes=F) +
  facet_wrap(~exp_variables, scales="free_y") +
  labs(x="Review count (log)", y="Values (differing scales for each x-variable)", color="County") +
  ggtitle("Scatterplot: Review count (log) vs. Neighborhood characteristics",
          subtitle="Using Yelp data in five counties around Atlanta, GA") +
  theme_bw()