UA Mini 4 Visualization for understanding the distribution of POIs across neighborhoods
R Setup
1 Coffee data
The following is an integrated dataset between ACS and Yelp, showing the sociodemographic characteristics of census tracts and the distribution of coffee shops in each census tract.
:) Reading layer `coffee' from data source `/Users/jaeglee/Library/CloudStorage/OneDrive-GeorgiaInstituteofTechnology/2024 Fall/UA/UA_R/assignments/data/coffee.csv' using driver `CSV'
:) field_1 GEOID county hhincome pct_pov review_count avg_rating pop avg_price pct_white hhincome_log review_count_log pct_pov_log yelp_n
:) 1 1 13063040202 Clayton County 33276 0.201342282 57 2 2850 1 0.075087719 10.41289217 4.060443011 -1.554276271 1
:) 2 2 13063040308 Clayton County 28422 0.210718002 13 3 4262 1 0.260675739 10.25527055 1.975621859 -1.510869401 2
:) 3 3 13063040407 Clayton County 49271 0.108255067 29.33333333 2 4046 1 0.20514088 10.80529389 3.320836964 -2.134911405 3
:) 4 4 13063040408 Clayton County 44551 0.180956612 20 4 8489 1 0.168688892 10.70461432 3.044522438 -1.65570904 1
:) 5 5 13063040410 Clayton County 49719 0.114680192 41 1 7166 1 0.193692437 10.81434354 3.737669618 -2.082003287 1
:) [ reached 'max' / getOption("max.print") -- omitted 358 rows ]
coffee_dtyped <- coffee %>%
select(-field_1) %>%
mutate(county = factor(str_split_i(county, " ", 1)),
hhincome = as.numeric(hhincome),
pct_pov = as.numeric(pct_pov)*100,
avg_rating = factor(avg_rating, levels=c("1", "2", "3", "4", "5")),
avg_price = as.numeric(avg_price),
pct_white = as.numeric(pct_white)*100,
hhincome_log = as.numeric(hhincome_log),
review_count_log = as.numeric(review_count_log),
pct_pov_log = as.numeric(pct_pov_log),
review_count = as.integer(review_count),
pop = as.integer(pop),
yelp_n = as.integer(yelp_n))
coffee_dtyped
:) GEOID county hhincome pct_pov review_count avg_rating pop avg_price pct_white hhincome_log review_count_log pct_pov_log yelp_n
:) 1 13063040202 Clayton 33276 20.1 57 2 2850 1 7.51 10.4 4.06 -1.55 1
:) 2 13063040308 Clayton 28422 21.1 13 3 4262 1 26.07 10.3 1.98 -1.51 2
:) 3 13063040407 Clayton 49271 10.8 29 2 4046 1 20.51 10.8 3.32 -2.13 3
:) 4 13063040408 Clayton 44551 18.1 20 4 8489 1 16.87 10.7 3.04 -1.66 1
:) 5 13063040410 Clayton 49719 11.5 41 1 7166 1 19.37 10.8 3.74 -2.08 1
:) [ reached 'max' / getOption("max.print") -- omitted 358 rows ]
2 Plot 1
In Plot 1, I examined the association between the average ratings of POIs in each census tract and the median household incomes of census tracts . In general, census tracts with higher rating POIs tend to have higher median incomes, implying that higher quality services are provided for those who can afford higher prices. Interestingly, census tracts with the highest ratings (=5) are largely located in lower-income neighborhoods. This might result from the fact that attractive POIs are often in downtown areas where many patrons with high incomes visit but the households residing in that areas often have lower incomes.
coffee_dtyped %>%
ggplot(aes(x=avg_rating, y=hhincome)) +
geom_boxplot() +
labs(x="Average rating", y="Household income") +
theme_bw()
3 Plot 2
In Plot 2, I expanded what I examined in Plot 1 by splitting the associations based on counties. The general trend seen in Plot 1 is consistent in DeKalb, Fulton, and Gwinnett counties. I guess Clayton and Cobb counties have too a small number of census tracts to draw conclusions on the association between average POI ratings and median household incomes.
coffee_dtyped %>%
ggplot(aes(x=avg_rating, y=hhincome)) +
geom_boxplot() +
facet_wrap(~ county) +
labs(x="Average yelp rating", y="Median annual household income ($)") +
theme_bw()
4 Plot 3
I referred to this to find a wide range of palettes in R. In
ggplot
, there are many different ways to use palettes. >
scale_ + color(or fill) + manual(or viridis, brewer, gradient)
There are also pre-designed palettes for well-regarded journals (Nature, etc.), and they are easily implemented by using functions like
scale_color_npg()
in theggsci
package.
library(viridis)
# Other options such as can also be used.
# viridis(), magma(), plasma(), inferno(), cividis()
scales::show_col(viridis(8))
In Plot 3, I notice a slightly positive association between the average number of reviews on POIs (logged) and the median household incomes in each census tract especially in Cobb, DeKalb, and Fulton counties. I also see that race and ethnicity also play a role. In DeKalb and Fulton counties, communities where whites are the majority tend to have higher median income levels and POIs with greater numbers of reviews. There is not only economic but also racial inequity in access to attractive, popular POIs in the our study area.
coffee_dtyped %>%
ggplot(aes(x=review_count_log, y=hhincome, col=pct_white)) +
geom_point(alpha=0.5, size=2) +
facet_wrap(~ county) +
scale_colour_gradient(low = "blue", high = "red") + # if want to use a divergent palette with a white color in the middle, use scale_color_gradient2
labs(x="Review count (log)", y="Median annual household income ($)", color="Percentage of residents\n who self-identified as white (%)") +
ggtitle("Scatterplot: Review count vs. Household income") +
theme_bw()
5 Plot 4
:) GEOID county hhincome pct_pov review_count avg_rating pop avg_price pct_white hhincome_log review_count_log pct_pov_log yelp_n
:) 1 13063040202 Clayton 33276 20.1 57 2 2850 1 7.51 10.4 4.06 -1.55 1
:) 2 13063040308 Clayton 28422 21.1 13 3 4262 1 26.07 10.3 1.98 -1.51 2
:) 3 13063040407 Clayton 49271 10.8 29 2 4046 1 20.51 10.8 3.32 -2.13 3
:) 4 13063040408 Clayton 44551 18.1 20 4 8489 1 16.87 10.7 3.04 -1.66 1
:) 5 13063040410 Clayton 49719 11.5 41 1 7166 1 19.37 10.8 3.74 -2.08 1
:) [ reached 'max' / getOption("max.print") -- omitted 358 rows ]
I used transformed the dataset into the long-format before the
visualization in order to leverage the useful facet_wrap()
function in ggplot
. Four columns, namely “hhincome”,
“pct_pov_log”, “pct_white”, and “pop”, were transformed.
levels = c("hhincome", "pct_pov_log", "pct_white", "pop")
labels = c("Median annual household income ($)", "Percent residents under poverty (log)", "Percent white resident", "Total population")
coffee_dtyped_longed <- coffee_dtyped %>%
pivot_longer(cols=c("hhincome", "pct_pov_log", "pct_white", "pop"),
names_to="exp_variables",
values_to="values") %>%
select(GEOID, county, exp_variables, values, review_count_log) %>%
mutate(exp_variables = factor(exp_variables,
levels=levels,
labels=labels))
coffee_dtyped_longed
:) # A tibble: 1,452 × 5
:) GEOID county exp_variables values review_count_log
:) <chr> <fct> <fct> <dbl> <dbl>
:) 1 13063040202 Clayton Median annual household income ($) 33276 4.06
:) 2 13063040202 Clayton Percent residents under poverty (log) -1.55 4.06
:) 3 13063040202 Clayton Percent white resident 7.51 4.06
:) 4 13063040202 Clayton Total population 2850 4.06
:) 5 13063040308 Clayton Median annual household income ($) 28422 1.98
:) 6 13063040308 Clayton Percent residents under poverty (log) -1.51 1.98
:) 7 13063040308 Clayton Percent white resident 26.1 1.98
:) 8 13063040308 Clayton Total population 4262 1.98
:) 9 13063040407 Clayton Median annual household income ($) 49271 3.32
:) 10 13063040407 Clayton Percent residents under poverty (log) -2.13 3.32
:) # ℹ 1,442 more rows
In Plot 4, I first notice that the review counts do not have much to do with the total population in each census tract, because people not only living within the community but also living across the metropolitan area all visit POIs in large sizes and high degrees of attractiveness. Importantly, the three proxies for socioeconomic characterisitics of communities exhibit consistent trends in regards to the review counts. The communities with higher income levels, small percentages of residents under poverty, and higher percentages of white residents tend to have POIs with greater average review counts, presumably more popular and attractive ones. The simple regression lines provide insights into the linear correlation between each proxy and review counts within each county’s context. While the correlation coefficient is measured on the entire sample across the five counties, the steepest regression line in Dekalb county implies that the county has the strongest asymmetries in POI distributions across socioeconomic status.
coffee_dtyped_longed %>%
mutate(county=as.character(county)) %>%
ggplot(aes(x=review_count_log, y=values, col=county)) +
geom_point(size=1) +
geom_smooth(method=lm, se=F, fullrange=F) +
ggpubr::stat_cor(aes(x=review_count_log, y=values),
method="pearson",
p.accuracy = 0.001, r.accuracy = 0.01,
inherit.aes=F) +
facet_wrap(~exp_variables, scales="free_y") +
labs(x="Review count (log)", y="Values (differing scales for each x-variable)", color="County") +
ggtitle("Scatterplot: Review count (log) vs. Neighborhood characteristics",
subtitle="Using Yelp data in five counties around Atlanta, GA") +
theme_bw()