Prepare The Data

setwd("C:/Users/49765/Desktop/Urban Analytics/mini4")

# Read CSV file
data <- read.csv("coffee.csv")

head(data)
##   X       GEOID         county hhincome    pct_pov review_count avg_rating
## 1 1 13063040202 Clayton County    33276 0.20134228     57.00000          2
## 2 2 13063040308 Clayton County    28422 0.21071800     13.00000          3
## 3 3 13063040407 Clayton County    49271 0.10825507     29.33333          2
## 4 4 13063040408 Clayton County    44551 0.18095661     20.00000          4
## 5 5 13063040410 Clayton County    49719 0.11468019     41.00000          1
## 6 6 13063040411 Clayton County    57924 0.09068942     18.00000          2
##   race.tot avg_price  pct_white hhincome_log review_count_log pct_pov_log
## 1     2850         1 0.07508772     10.41289         4.060443   -1.554276
## 2     4262         1 0.26067574     10.25527         1.975622   -1.510869
## 3     4046         1 0.20514088     10.80529         3.320837   -2.134911
## 4     8489         1 0.16868889     10.70461         3.044522   -1.655709
## 5     7166         1 0.19369244     10.81434         3.737670   -2.082003
## 6    13311         1 0.16512659     10.96706         2.944439   -2.295715
##   yelp_n
## 1      1
## 2      2
## 3      3
## 4      1
## 5      1
## 6      1

Visualization

As the average rating increases, so does its corresponding household income, except for 5. Assuming that higher rated cafes usually offer higher quality products, the phenomenon shown in the chart suggests that higher rated cafes with their higher quality products are usually distributed at higher household incomes, suggesting that higher quality products tend to command a higher price. However, the highest quality products, i.e., the products offered by the restaurants with a score of 5, are good value for money.

ggplot(data, aes(factor(avg_rating), hhincome)) +
  geom_boxplot() +
  xlab("Average Rating") +
  ylab("Household Income") +
  ggtitle("Boxplot of Income vs. Average Rating")

The charts show the relationship between average ratings and household income in the different counties. In Clayton County, almost all of the rated cafes are distributed in locations with low household incomes, probably due to the low household incomes in the county as a whole, as well as the low number of cafes. there are no cafes with a rating of 1 in Cobb County, and the distribution of the corresponding household incomes of the cafes is more consistent. In DeKalb County, even the cafes with a rating of 1 have higher household incomes and there are more cafes in very high household income locations.Fulton County and Gwinnett County are more consistent with the region as a whole.

ggplot(data = data, aes(x = factor(avg_rating), y = hhincome)) +
  geom_boxplot(aes(fill = factor(avg_rating)), show.legend = FALSE) +
  labs(
    #title = "Boxplot of Avg Rating by Household Income",
    x = "Average Rating",
    y = "Household Income"
  ) +
  theme_minimal() +
  scale_fill_manual(values = rep("white", 5)) +  # 设置颜色为白色
  facet_wrap(~ county, ncol = 3 ) +
  theme(strip.background = element_rect(fill = "lightgrey")) 

The position of the dots responds to the relationship between review count and household income, and the color of the dots responds to their corresponding percentage of whites.The number of dots reflects the number of cafes in the area.One obvious feature is that cafes with more white people usaully have higher household incomes.

ggplot(data = data, aes(x = review_count_log, y = hhincome, color = pct_white)) +
  geom_point() +
  labs(
    x = "Review Count(Log)",
    y = "Household Income",
    color = "P(white)"
  ) +
  theme_minimal() +
  facet_wrap(~ county, ncol = 3) +
  theme(strip.background = element_rect(fill = "lightgrey"))

In this image, the colors of the dots reflect different counties. Very in this graph, the color of the dots reflects the different counties. The correlations reflected in all four data sets are weak, with only a slightly stronger correlation with the percentage of white residents.

# 使用pivot_longer将多个变量转换成长格式
df_long <- data %>%
  pivot_longer(cols = c("hhincome", "pct_pov_log", "pct_white", "race.tot"),
               names_to = "Variable", values_to = "Value")

# 创建散点图,按照county着色,并使用facet_wrap创建子图
ggplot(df_long, aes(x = review_count_log, y = Value, color = county)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE) + # 添加拟合线
  facet_wrap(~ Variable, scales = "free") +
  labs(
    x = "Review Count(Log)",
    y = "Values",
    title = "The relationships between different values and review_count_log"
  )