Hint: You can choose any data you like but can’t take one that is already taken by other groups.
library(tidyverse)
wine_ratings <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-28/winemag-data-130k-v2.csv") %>%
# lump together least common factor levels
mutate(variety = fct_lump(variety, 7)) %>%
# filter out "Other"
filter(variety != "Other")
# Explain the boxplots. What is the top rated variety? How about the outliers? Which variety seems to be more consistent in their ratings than others?
wine_ratings %>%
mutate(variety = fct_reorder(variety, points, na.rm = TRUE)) %>%
ggplot(aes(variety, points, fill = variety)) +
geom_boxplot() +
coord_flip() +
theme(legend.position = "none") +
labs(title = "wine ratings by variety",
subtitle = "Top ten most common varietys",
x = NULL,
y = NULL)
# The intercept represents predicted rating for variety: Bordeaux-style Red Blend that is free.
# Bordeaux-style Red Blend that is free is expected to have a point of 78 on a 0-100 scale.
# Point increases by about 2 every time price doubles.
# Some varieties (e.g., Cabernet Sauvignon) have negative effect on points, while other have positive effect (e.g., Riesling). what variety of wines would you produce if you were running winneries and had choices?
wine_ratings %>%
lm(points ~ log2(price) + variety, data = .) %>%
summary()
##
## Call:
## lm(formula = points ~ log2(price) + variety, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.8406 -1.5161 0.1576 1.7000 9.2261
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.15191 0.06455 1210.695 < 2e-16 ***
## log2(price) 2.11226 0.01104 191.377 < 2e-16 ***
## varietyCabernet Sauvignon -0.45211 0.04111 -10.999 < 2e-16 ***
## varietyChardonnay 0.09023 0.04004 2.254 0.0242 *
## varietyPinot Noir 0.03115 0.03917 0.795 0.4265
## varietyRed Blend -0.03726 0.04192 -0.889 0.3741
## varietyRiesling 1.50753 0.04745 31.773 < 2e-16 ***
## varietySauvignon Blanc 0.37196 0.04859 7.655 1.96e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.397 on 56816 degrees of freedom
## (3690 observations deleted due to missingness)
## Multiple R-squared: 0.4136, Adjusted R-squared: 0.4136
## F-statistic: 5726 on 7 and 56816 DF, p-value: < 2.2e-16
The data contains close to 130,000 wine reviews. One review per row. The points represent wine ratings on a scale of 1-100. The data contains over 708 different varieties of wine from over 73 different countries. Other variables include the winery where the wine was produced, the tasters name, and the price of the wine.
The Box Plots are in decending order which means Riesling is the top choice based on median rating, right before pinot noir. The interquartile range shows consistency in ratings for the middle half of the wines. Red Blend with a thinner box has ratings that are more consistant than others.
The intercept represents predicted rating for variety: Bordeaux-style wine that is free. In easier to understand terms, Bordeaux-style Red blend that is free is expected to have a point of 78 on a 0-100 scale. Log2 of points has the coefficient of about 2, which means point increases about 2 evrytime price doubles. Some varieties have negative effect on points (Cabernet Sauvignon), while others have positive effect. If we ran a winery and had a choice of what wine we were producing, we would choose Riesling because thats a wine people seem to enjoy the most.
Hint: Create at least two plots.
library(ggplot2)
ggplot(data = wine_ratings,
mapping = aes(x = points, y = price)) +
geom_point()
ggplot(wine_ratings, aes(x = points)) +
geom_bar()
```
data(wine_ratings)
# select numeric variables
df <- dplyr::select_if(wine_ratings, is.numeric)
# calulate the correlations
r <- cor(df, use="complete.obs")
round(r,2)
## X1 points price
## X1 1.00 0.01 0.0
## points 0.01 1.00 0.4
## price 0.00 0.40 1.0
houses_lm <- lm(points ~ price,
data = wine_ratings)
summary(houses_lm)
##
## Call:
## lm(formula = points ~ price, data = wine_ratings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.684 -1.964 -0.042 2.046 10.202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.765e+01 1.556e-02 5634.6 <2e-16 ***
## price 2.607e-02 2.491e-04 104.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.866 on 56822 degrees of freedom
## (3690 observations deleted due to missingness)
## Multiple R-squared: 0.1616, Adjusted R-squared: 0.1616
## F-statistic: 1.096e+04 on 1 and 56822 DF, p-value: < 2.2e-16