For this example we will use data from Wine Enthuisast magazine that was procured from kaggle.com.


Libraries

First we load the necessary libraries.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rvest)
## Loading required package: xml2
library(broom)
## Warning: package 'broom' was built under R version 3.5.3
library(ggplot2)
library(stringr)
library(knitr)


Get the data

Then we import the data from github.

wine_df_raw <- read.csv('https://raw.githubusercontent.com/mehtablocker/cuny_605/master/data_files/winemag_data.csv', stringsAsFactors = F)
wine_df_raw %>% head(1) %>% kable()
X.1 X country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
35062 35061 Portugal Soft, light, flavored with raspberry and red fruit. There is a gentle tannin character to this open, fresh wine. Bonifácio 83 7 Portuguese Table Wine Roger Voss @vossroger Caves Bonifacio 2009 Bonifácio Red (Portuguese Table Wine) Portuguese Red Caves Bonifacio


Build the model

We will create an indicator variable (is_us) for whether each wine is from the US or not, and then fit a multi-variate regression model to predict the wine rating (points) using is_us, price, price2, and an interaction term for is_us and price.

wine_df <- wine_df_raw %>% 
  mutate(is_us = ifelse(country=="US", 1, 0)) %>% 
  filter(price<500, !is.na(is_us))   #filter NAs and outlier values
lm_fit <- lm(points ~ is_us + price + I(price^2) + is_us*price, data=wine_df)
summary(lm_fit)
## 
## Call:
## lm(formula = points ~ is_us + price + I(price^2) + is_us * price, 
##     data = wine_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9368  -1.6195   0.0583   1.7924  14.0544 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  8.568e+01  3.063e-02 2797.045   <2e-16 ***
## is_us       -4.121e-02  4.263e-02   -0.967    0.334    
## price        9.211e-02  9.342e-04   98.602   <2e-16 ***
## I(price^2)  -1.990e-04  3.250e-06  -61.234   <2e-16 ***
## is_us:price -1.583e-03  9.702e-04   -1.631    0.103    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.495 on 36261 degrees of freedom
## Multiple R-squared:  0.3264, Adjusted R-squared:  0.3263 
## F-statistic:  4393 on 4 and 36261 DF,  p-value: < 2.2e-16

The adjusted R-squared of 0.3263477 indicates a poor fit. The intercept shows that “zeroing” everything out would lead to a rating of 85.6837982.

The coefficients for is_us and the interaction term show that being from the US hurts wines at lower price ranges and then hurts even more as price increases. The coefficients for price and price2 show that an increase in price helps a wine’s rating up to a point, then it starts to hurt. (Keep in mind we filtered out the outlier prices above 500 dollars in order to better fit the overall model.)


Examine the residuals

We use the broom package to put the model into a data frame, then view the residuals in a variety of ways.

lm_df <- augment(lm_fit)

### residual plot
lm_df %>% 
  ggplot(aes(x = .fitted, y = .resid)) + 
  geom_point() + 
  geom_hline(yintercept = 0)

### residual histogram
lm_df %>% 
  ggplot(aes(.resid)) + 
  geom_histogram(bins=30)

### residual qqplot
lm_df %>% 
  ggplot(aes(sample = .resid)) + 
  stat_qq() + 
  stat_qq_line()

The residual plots show some non-normality and outlier values. Between these as well as the low R-squared value, we determine that this model is not a particularly good fit.