CUNY 605

For this example we will use data from Wine Enthuisast magazine that was procured from kaggle.com.

Libraries

First we load the necessary libraries.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(rvest)

## Loading required package: xml2

library(broom)

## Warning: package 'broom' was built under R version 3.5.3

library(ggplot2)
library(stringr)
library(knitr)

Get the data

Then we import the data from github.

wine_df_raw <- read.csv('https://raw.githubusercontent.com/mehtablocker/cuny_605/master/data_files/winemag_data.csv', stringsAsFactors = F)
wine_df_raw %>% head(1) %>% kable()

X.1	X	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
35062	35061	Portugal	Soft, light, flavored with raspberry and red fruit. There is a gentle tannin character to this open, fresh wine.	BonifÃ¡cio	83	7	Portuguese Table Wine			Roger Voss	@vossroger	Caves Bonifacio 2009 BonifÃ¡cio Red (Portuguese Table Wine)	Portuguese Red	Caves Bonifacio

Build the model

We will create an indicator variable (is_us) for whether each wine is from the US or not, and then fit a multi-variate regression model to predict the wine rating (points) using is_us, price, price², and an interaction term for is_us and price.

wine_df <- wine_df_raw %>% 
  mutate(is_us = ifelse(country=="US", 1, 0)) %>% 
  filter(price<500, !is.na(is_us))   #filter NAs and outlier values
lm_fit <- lm(points ~ is_us + price + I(price^2) + is_us*price, data=wine_df)
summary(lm_fit)

## 
## Call:
## lm(formula = points ~ is_us + price + I(price^2) + is_us * price, 
##     data = wine_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9368  -1.6195   0.0583   1.7924  14.0544 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  8.568e+01  3.063e-02 2797.045   <2e-16 ***
## is_us       -4.121e-02  4.263e-02   -0.967    0.334    
## price        9.211e-02  9.342e-04   98.602   <2e-16 ***
## I(price^2)  -1.990e-04  3.250e-06  -61.234   <2e-16 ***
## is_us:price -1.583e-03  9.702e-04   -1.631    0.103    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.495 on 36261 degrees of freedom
## Multiple R-squared:  0.3264, Adjusted R-squared:  0.3263 
## F-statistic:  4393 on 4 and 36261 DF,  p-value: < 2.2e-16

The adjusted R-squared of 0.3263477 indicates a poor fit. The intercept shows that “zeroing” everything out would lead to a rating of 85.6837982.

The coefficients for is_us and the interaction term show that being from the US hurts wines at lower price ranges and then hurts even more as price increases. The coefficients for price and price² show that an increase in price helps a wine’s rating up to a point, then it starts to hurt. (Keep in mind we filtered out the outlier prices above 500 dollars in order to better fit the overall model.)

Examine the residuals

We use the broom package to put the model into a data frame, then view the residuals in a variety of ways.

lm_df <- augment(lm_fit)

### residual plot
lm_df %>% 
  ggplot(aes(x = .fitted, y = .resid)) + 
  geom_point() + 
  geom_hline(yintercept = 0)

### residual histogram
lm_df %>% 
  ggplot(aes(.resid)) + 
  geom_histogram(bins=30)

### residual qqplot
lm_df %>% 
  ggplot(aes(sample = .resid)) + 
  stat_qq() + 
  stat_qq_line()

The residual plots show some non-normality and outlier values. Between these as well as the low R-squared value, we determine that this model is not a particularly good fit.

CUNY 605

Week 12 Discussion

mehtablocker

April 15, 2019

Libraries

Get the data

Build the model

Examine the residuals