mini4

load libraries and data

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(sf)

## Linking to GEOS 3.9.1, GDAL 3.4.3, PROJ 7.2.1; sf_use_s2() is TRUE

library(tmap)
library(leaflet)
library(tidycensus)
library(readr)
library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(ggdark)
library(ggplot2)
library(ggpubr)

data <- read_csv("coffee.csv")

## New names:
## Rows: 363 Columns: 14
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): county dbl (13): ...1, GEOID, hhincome, pct_pov, review_count, avg_rating,
## race.tot...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

head(data)

## # A tibble: 6 × 14
##    ...1     GEOID county hhinc…¹ pct_pov revie…² avg_r…³ race.…⁴ avg_p…⁵ pct_w…⁶
##   <dbl>     <dbl> <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1     1   1.31e10 Clayt…   33276  0.201     57         2    2850       1  0.0751
## 2     2   1.31e10 Clayt…   28422  0.211     13         3    4262       1  0.261 
## 3     3   1.31e10 Clayt…   49271  0.108     29.3       2    4046       1  0.205 
## 4     4   1.31e10 Clayt…   44551  0.181     20         4    8489       1  0.169 
## 5     5   1.31e10 Clayt…   49719  0.115     41         1    7166       1  0.194 
## 6     6   1.31e10 Clayt…   57924  0.0907    18         2   13311       1  0.165 
## # … with 4 more variables: hhincome_log <dbl>, review_count_log <dbl>,
## #   pct_pov_log <dbl>, yelp_n <dbl>, and abbreviated variable names ¹hhincome,
## #   ²review_count, ³avg_rating, ⁴race.tot, ⁵avg_price, ⁶pct_white

Plot 1

ggplot(data, aes(factor(avg_rating),hhincome))+geom_boxplot()+
  xlab("Avg Rating")+ylab("Household Income")+
   geom_boxplot(fill='#FFA4B6', color="purple")

The average rating some what varies depending on household income. It seems like the median average income is below 100,000 for each rating, indicating that it could be interesting to further explore why higher household income individuals are not as likely to rate coffee shops. The median household income is lowest for a 1 star rating and 5 star rating. This might be because coffee is expensive, and so higher household income individuals think a little less about how much they are spending and are less critical of coffee shops in their area. The distribution is almost normal for the 3 rating but left skwed for all other ratings. The mean of negatively skewed data is is less than the median, so the average household income per rating is a little lower than the medians shown.

Plot 2

ggplot(data, aes(factor(avg_rating),hhincome))+geom_boxplot()+
  xlab("Average Yelp Rating")+ylab("Median Household Income ($)")+
   geom_boxplot(fill='#FFA4B6', color="purple")+
    facet_wrap(~county)

Clayton county has very low median household income regardless of the rating. I searched the overall median household income for the county and it is around 50,000 so maybe these results are due to the overall breakdown of the county.The graph may look more accurate with a different y axis. Additionally, Dekalb county has the most amount of outliars, recreating the graph after removing them may lead to more accurate results. Fulton county has the most spread in it’s data, which could mean that house hold income varies in this county more than other counties. In the previous plot the median household incomes were lowest for ratings one and five, when breaking the data down into counties it can be seen that this still holds true.

Plot 3

ggplot(data = data) +
  geom_point(mapping = aes(x=review_count_log, y=hhincome, color = pct_white),alpha = 0.6) +
  labs(x = "Review Count (log)",
       y = "Median Annual Household Income",
       color = "Proportion of Residents \n who self-identified\n as white",
       title = "Scatterplot: Review Count vs. Household Income") +
      facet_wrap(~county)+
  theme_light()+
  scale_color_gradient(low="darkblue", high="red")

This plot visualizes the fact that tracts with more people self-identifying as white have higher house hold incomes, as can be seen when comparing Clayton County to any other county. The spread is similar in each graph, again except Clayton. This could indicate that people in low median annual household income areas are less likely to leave reviews. This might also be because of lack of access. It would be interesting to see if distribution of coffee shops have anything to do with annual household income.

Plot 4

#finding r sq for each plot

rhhincome<-cor(data$review_count_log,data$hhincome)
rpov<-cor(data$review_count_log,data$pct_pov_log)
rwhite<-cor(data$review_count_log,data$pct_white)
rtotalpop<-cor(data$review_count_log,data$race.tot)

rhhincome

## [1] 0.1332059

rpov

## [1] -0.1878719

rwhite

## [1] 0.2776534

rtotalpop

## [1] -0.01707485

#find pvalue for each plot

regression= lm(review_count_log~hhincome,data)
summary(regression)

## 
## Call:
## lm(formula = review_count_log ~ hhincome, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6426 -0.5533  0.0529  0.6004  3.6950 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.144e+00  1.256e-01  25.023   <2e-16 ***
## hhincome    3.719e-06  1.456e-06   2.554   0.0111 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.004 on 361 degrees of freedom
## Multiple R-squared:  0.01774,    Adjusted R-squared:  0.01502 
## F-statistic: 6.521 on 1 and 361 DF,  p-value: 0.01107

regression2= lm(review_count_log~pct_pov_log,data)
summary(regression2)

## 
## Call:
## lm(formula = review_count_log ~ pct_pov_log, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6207 -0.5490  0.0403  0.5993  3.6177 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.85709    0.16733  17.074  < 2e-16 ***
## pct_pov_log -0.26365    0.07255  -3.634 0.000319 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.995 on 361 degrees of freedom
## Multiple R-squared:  0.0353, Adjusted R-squared:  0.03262 
## F-statistic: 13.21 on 1 and 361 DF,  p-value: 0.0003192

regression3= lm(review_count_log~pct_white,data)
summary(regression3)

## 
## Call:
## lm(formula = review_count_log ~ pct_white, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7484 -0.5491  0.0531  0.6274  3.5923 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.9113     0.1082  26.916  < 2e-16 ***
## pct_white     1.0655     0.1940   5.491 7.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9732 on 361 degrees of freedom
## Multiple R-squared:  0.07709,    Adjusted R-squared:  0.07453 
## F-statistic: 30.15 on 1 and 361 DF,  p-value: 7.537e-08

regression4= lm(review_count_log~race.tot,data)
summary(regression4)

## 
## Call:
## lm(formula = review_count_log ~ race.tot, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7670 -0.5257  0.0425  0.5489  3.7358 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.468e+00  1.150e-01  30.156   <2e-16 ***
## race.tot    -5.199e-06  1.602e-05  -0.324    0.746    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.013 on 361 degrees of freedom
## Multiple R-squared:  0.0002916,  Adjusted R-squared:  -0.002478 
## F-statistic: 0.1053 on 1 and 361 DF,  p-value: 0.7458

#create columns to facet wrap with names 
pivot_data <- data %>%pivot_longer(cols=c('pct_pov_log','hhincome','pct_white','race.tot'),
                               names_to = 'names',
                               values_to ='values')

#get correct graph titles
pivot_data$names[pivot_data$names=="pct_pov_log"]<-"Percent Residents Under Poverty"
pivot_data$names[pivot_data$names=="hhincome"]<-"Median Annual Household Income ($)"
pivot_data$names[pivot_data$names=="pct_white"]<-"Percent White Resident"
pivot_data$names[pivot_data$names=="race.tot"]<-"Total Population"


#Plot
ggplot(data = pivot_data, mapping = aes(x=review_count_log, y=values, color = county)) +
  facet_wrap(~names,scales ='free')+
  geom_point() + 
  geom_smooth(mapping = aes(color = county), 
             method = "lm", se=F)+
     labs(title = "Scatterplot between logged review count & neighborhood characteristics",
              subtitle = "Using Yelp data in Five Counties Around Atlanta, GA",
           color = "County") +
  ylab('Values')+xlab("Review Count Logged")

## `geom_smooth()` using formula 'y ~ x'

The R values indicate a weak correlation for all of the graphs which could mean that none of the y-variables are correlated with the review count. However it is also possible that this is due to outliars.The data could be more accurate if they were removed. It is also possible that the R values per county are very different and that in certain counties there is a correlation while in others there is not. For example, I calculated the correlation between house hold income and review count below for Dekalb County and it is .42, which is a moderate correlation. This is much stronger than the 0.13 R value that is assigned to the data as a whole. I think remaking this graph and calculating the p-value and r-value per county would provide better results.

rdekalb<- data %>% filter(county == "DeKalb County")
cor(rdekalb$review_count_log,rdekalb$hhincome)

## [1] 0.4259369