Box Plot 1
coffee <- read.csv('coffee.csv')
bxplot <- ggplot(data = coffee) +
geom_boxplot(aes(x = as.character(avg_rating), y = hhincome),
color = 'black', fill ='white') +
labs(title = 'household income and average rating box blot', x = 'average rating', y = 'household income')
plotly::ggplotly(bxplot)
#had to change the xaxis into charateter so that R wouldn't read it as a continous variable.
Based on the boxplots, we see that the average rating loosely correlates to household income. However, there is a big drop off once we get to an average rating of five. A lot of outliers in 2 is interesting, and the range of 4 is huge.
Box Plot 2
plot2 <- coffee %>%
select(c(county, hhincome, avg_rating)) # i like to make a separate df when doing this excercise just to make things more organized if I need to manipulate
bxplot2 <- ggplot(data = plot2) +
geom_boxplot(aes(x = as.character(avg_rating), y = hhincome),
color = 'black', fill ='white')+
facet_wrap(~county) + # break apart by county
labs(title = 'average raing and household income by county',
x = 'average rating',
y = 'household income')
plotly::ggplotly(bxplot2)
With these boxplots, we see where the outliers are. Cobb and Fulton County have pretty wide ranges on all of their data while Clayton’s is extremely compressed. I think that Gwinnett and Dekalb County have decently spread out data regarding coffee shops’ rating and hh income.
Plot 3
plot.df <- coffee %>%
select(c(county, pct_white, hhincome, review_count_log))
plot3 <- ggplot(data = plot.df,
aes(x = review_count_log, y = hhincome)) +
geom_point(mapping = aes(color = pct_white),
size = 4.2,
alpha = .45) +
facet_wrap( ~ county) +
scale_color_gradient(low = 'purple', high = 'red') +
labs(x = 'review count(log)',
y = 'hhincom',
color = 'pct_white',
title = 'Scatterplot: Review Count vs. Household Income')
print(plot3)
plot 3 attempts to show the distribution of review counts by household income. From this graph and previous, i’m starting to suspect that there aren’t that many coffee shops in Clayton County and that wages are generally pretty low. Meanwhile, Fulton county is all over the place. Gwinnett, Deklab, and Cobb county start to show a normal-ish distribution in regards to income and review counts. When we add race to the factor, we see that race and household income are linked factors with most high earners being white and most low earners being nonwhie. However, there appears to be a healthy mix of racial incomes in the middle of Fulton, Cobb, and Gwinnett.
Plot 4
plot4.df <- coffee %>%
select(c(X ,county, pct_white, review_count_log,pct_pov_log,pop,hhincome))
plot4.df <- plot4.df %>%
pivot_longer(cols = c(pct_white,pct_pov_log,pop,hhincome),
names_to = 'variables',
values_to = 'value')
plot4 <- ggplot(data = plot4.df,
aes(x = review_count_log, y = value, color = county)) +
geom_point(size = .65,
alpha = .65) +
geom_smooth(method = lm) +
facet_wrap( ~ variables, scales ='free_y',
labeller = as_labeller(c(
hhincome = 'median household income ($)',
pct_pov_log = 'percent poverty (%log)',
pct_white = 'percent white (%)',
pop = 'total population'
))) +
labs(x = 'Review Count (log)',
y = 'values',
color = 'county',
title = 'Scatterplots among review logged review count and neighborhood charateristics')
print(plot4)
## `geom_smooth()` using formula = 'y ~ x'
The four scatterplots and their regression lines are attempting to show how the variable affects review counts by county. For most counties, median household income is not a good predictive variable for review count except for Dekalb county whose line is positively sloped. Percent poverty has negative lines across the board for review count with some counties more modest than others. Percent white and review count oddly have a positive relationship with curves trending upwards and total population and review count seem to have a near zero influence on review counts. In combination, the story these graphs are saying is that review counts will tend to go up if you are in a relatively medium-high household income majority white neighborhood. I wasn’t sure how to get the R^2 and P-values into the graphs like in your example.