The data was downloaded from Kaggle[^1] and consists of 129,971 records, gathered from Wine Enthusiast Magazine. Although there are a few qualitative variables, we will be looking at the score the wine was given as a function of price.
In other words, is the price you pay a good indicator of quality?
# Read in the data
wine <- read_csv("winemag.csv")
# Fix column names
colnames(wine)[1] <- "ID"
Let’s have a look at the data:
summary(wine$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4.00 17.00 25.00 35.36 42.00 3300.00 8996
summary(wine$points)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 80.00 86.00 88.00 88.45 91.00 100.00
First, we have some NA values in the price
field that we will have to deal with.
On the other hand points
seems pretty solid. According to the data source, the magazine only prints reviews for wines with a score of 80 or higher, which explains the minimum we see.
First, our scale is a little large on the \(x\) axis due to a few expensive outliers. Personally, I am never paying that much for a bottle of wine, so it may make sense to limit our dataset to wines with a more reasonable cost.
# Limit to wines under $200
wine2 <- wine %>% filter(price < 200 & !is.na(price))
# Plot data
wine2 %>% ggplot(aes(x=price, y=points)) +
geom_point(position="jitter") +
labs(title="Wine Quality vs. Price",
subtitle="(Priced under $200)",
x="Price ($)", y="Points (0-100)")
Looking at the remaining wines, a linear relationship seems less likely. So, we may want to log transform the x-axis.
# Limit to wines under $200
wine2$logPrice <- log(wine2$price)
# Plot data
wine2 %>% ggplot(aes(x=logPrice, y=points)) +
geom_point(position="jitter") +
labs(title="Wine Quality vs. Price",
subtitle="(Priced under $200)",
x="ln(Price) ($)", y="Points (0-100)")
We can see the linear relationship more clearly now. However, it is obvious even from this basic plot that there is a large variation at any particular price point.
Perhaps we should look at a specific country?
# Limit to Italian Wines
wine3 <- wine2 %>% filter(country == "Italy")
# Plot data
wine3 %>% ggplot(aes(x=logPrice, y=points)) +
geom_point(position="jitter") +
labs(title="Wine Quality vs. Price",
subtitle="(Italian Wines Priced under $200)",
x="ln(Price) ($)", y="Points (0-100)",
color="Country")
Limiting to a single country has helped somewhat (if you’re looking for Italian wines). Let’s see how a linear regression would work with this subset.
# Generate the model
m <- lm(wine3$points ~ wine3$logPrice)
# Augment the output
model <- augment(m)
# Re-Plot with model
model %>% ggplot(aes(x=wine3.logPrice, y=wine3.points)) +
geom_point(position="jitter") +
labs(title="Wine Quality vs. Price",
subtitle="(Italian Wines Priced under $200)",
x="ln(Price) ($)", y="Points (0-100)",
color="Country") +
geom_line(aes(x=wine3.logPrice, .fitted), linetype=1,
size=1, col="darkgreen")
Let’s see how it worked:
summary(m)
##
## Call:
## lm(formula = wine3$points ~ wine3$logPrice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0641 -1.2089 0.0179 1.2699 7.7191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.56718 0.08297 959.0 <2e-16 ***
## wine3$logPrice 2.64387 0.02393 110.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.99 on 16776 degrees of freedom
## Multiple R-squared: 0.4213, Adjusted R-squared: 0.4212
## F-statistic: 1.221e+04 on 1 and 16776 DF, p-value: < 2.2e-16
Our model parameters seem to be significant, and our r-squared is a decent 0.42. Let’s look at residuals next:
# Residual Plot
model %>%
ggplot(aes(x=.fitted, y=.resid)) +
geom_point() +
labs(title="Residual Plot",
x="Fitted Value", y="Residual") +
geom_hline(yintercept = 0, linetype = 1, col = "darkgreen")
Our residuals show a mostly uniform variance, except near the lower end of the fitted values where is seems to be less variable.
# QQ Plot
model %>%
ggplot(aes(sample=.resid)) +
stat_qq() + stat_qq_line(col="darkgreen", linetype = 1) +
labs(title = "QQ Plot",
x = "Theoretical",
y = "Sample")
Here we see the residuals are relatively normally-distributed except near the left end of the graph.
Although the regression was somewhat successful, the usefulness of the model is low. Although it does show (at least for Italian wines) that price does somewhat correlate with quality, the large variance of the residuals shows that two wines the same price point can differ significantly.