Introduction

Wine points has been existing in the wine industry for decades.
There have been strong opinions about the correlation between wine points given by critics and their prices (Vincast, 2016)
But it is subjective when it comes to tastes, critics may have different opinions.
Are critics’ opinions worth your considerations to spend your money?
Therefore, this investigation aims at determining whether there is a linear relationship between wine points and their prices in the markets.

Problem Statement

“Do wine points have any effect on their prices in the markets?”

To answer all the questions above, we will use Linear analysis to check the relation between wine prices and their points.

Our analysis steps:

Plotting the relationship between Point (x) and Price (y) to determine if there is any linear relationship and if has, identifying the linear trend.
Fitting linear regression and performing all hypothesis tests of the various model components.
Testing the various assumptions behind linear regression.
Drawing the conclusion of output of a simple linear regression analysis and discussing the strengths/weaknesses.

Data

The chosen dataset is open data sourced from Kaggle.

The dataset contains wine information as below:

Wine originals
Wine review descriptions
Wine prices (in Dollars)
Wine points rated by WineEnthusiasts (based on Quality, Body, and Aroma)
Tasters’ Names
Wine variety
Winery

Two variables used for analysis - Prices (The dependent variable) and Points (The Predictor)

Data Cont.

wine <- read_csv("C:/Users/My Anh/Documents/Intro to Statistics/assignment 3/wine_combined.csv")

Number of observations: 280,901
Number of variables: 11

Note: Data preprocessing tasks done prior to the analysis:

Joining two separate wine review datasets by rbind() function.
Filter out unnecessary variables.
Addressing NA values in Price variable by mean imputation.

Decsriptive Statistics

Summarise Price

wine %>% summarise(Min = min(price,na.rm = TRUE),
                                           Q1 = quantile(price,probs = .25,na.rm = TRUE),
                                           Median = median(price, na.rm = TRUE),
                                           Q3 = quantile(price,probs = .75,na.rm = TRUE),
                                           Max = max(price,na.rm = TRUE),
                                           Mean = mean(price, na.rm = TRUE),
                                           SD = sd(price, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(price))) -> table1
knitr::kable(table1)

Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
4	17	27	40	3300	34.16285	37.01939	280901	0

Summarise Points

wine %>% summarise(Min = min(points,na.rm = TRUE),
                                           Q1 = quantile(points,probs = .25,na.rm = TRUE),
                                           Median = median(points, na.rm = TRUE),
                                           Q3 = quantile(points,probs = .75,na.rm = TRUE),
                                           Max = max(points,na.rm = TRUE),
                                           Mean = mean(points, na.rm = TRUE),
                                           SD = sd(points, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(points))) -> table2
knitr::kable(table2)

Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
80	86	88	90	100	88.14693	3.151528	280901	0

Data Visualisation

No actions needed to tackle Price’s outliers (less than 5%).
Plot the relationship between y (Price) and x(Points) on a scatter plot.

plot(price ~ points, data = wine)

Data Visualisation Cont. - Data transformation

While Point variable is normally distributed, Price variable looks right-skewed.

Solution: using log() function.

The LEFT and RIGHT histograms correspond to the difference BEFORE and AFTER data transformation.

par(mfrow=c(2,2))
wine$price %>%  hist(main = "Price")
log(wine$price) %>%  hist (main = "log(Price)")
wine$points %>%  hist (main = "Points")
wine$points %>%  hist (main = "Points")

Data Visualisation Cont.- Linear Plot

plot(log(price) ~ points, data = wine)

There is likely to be positive relationship between Prices and Points.

Our linear equation: log(price) = a + b x points + e

Hypothesis Testing - Overall Linear Regression Model

A hypothesis test:

$H_0$: The data does not fit the linear regression model.

$H_A$: The data fits the linear regression model.

price_points <- lm(log(price) ~ points, data = wine)
price_points %>% summary()

## 
## Call:
## lm(formula = log(price) ~ points, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9091 -0.3655 -0.0449  0.3316  4.8221 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.2339124  0.0270672  -267.3   <2e-16 ***
## points       0.1194712  0.0003069   389.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5126 on 280899 degrees of freedom
## Multiple R-squared:  0.3505, Adjusted R-squared:  0.3505 
## F-statistic: 1.516e+05 on 1 and 280899 DF,  p-value: < 2.2e-16

(F(1,280899)=1.516e+05), p<.001: We reject $H_0$

There is statistically significant evidence that the data fits the linear regression model.

The best line fit: log(Price) = -7.234 + 0.119 x Points

Hypothesis Testing - Interpretation

Interpreting the interception:

When Points = 0, log(Price) = -7.234

It is impossible to start with 0 point because as stated in the data disclaimer, only wines that score at least 80 are taken into account. Therefore, the intercept does not have a meaningful interpretation (log can never be negative!)

Interpreting the slope:

As the point increases by 1, the log(Price) changes on the average by $0.119.

Intepreting R2:

R2 = 0.3505

Meaning: 35% of the variability in log(Price) can be explained by a linear relationship with points.

Hypothesis Testing - Linear Regression Model Parameters

price_points %>% confint()

##                  2.5 %     97.5 %
## (Intercept) -7.2869633 -7.1808615
## points       0.1188698  0.1200727

Intercept:

$H_0$: a = 0

$H_A$: a <> 0

p = 2e-16 (< 0.01), the 95% CI [-7.2869633, -7.1808615] does not capture $H_0$. So we reject $H_0$.

Slope:

$H_0$: b = 0

$H_A$: b <> 0

p = 2e-16 (< 0.01), the 95% CI [0.1188698, 0.1200727] does not capture $H_0$. So we reject $H_0$.

Testing Assumptions

Independent

Independence was assumed as each wine prices and points came from different and unique ID which represents each individuals.

Linearity

price_points %>% plot(which = 1)

In the plot above, the variance appears to remain the same. The red line fits to the data.

Testing Assumptions Cont.

Normality of residuals

price_points %>% plot(which = 2)

The point suggests there are no major deviations from normality. It would be safe to assume the residuals are approximately normally distributed.

Testing Assumptions Cont.

Homoscedasticity

price_points %>% plot(which = 3)

The line fits to the data.The variance in residuals appears constant across predicted values.

Testing Assumptions Cont.

Influential cases

price_points %>% plot(which = 5)

In the plot above, no values fall close to these bands. In fact, the bands are not even visible.

Linear Strength and Direction of Linear Relationship

Correlation Coefficient r:

A hypothesis test for r:

$H_0$: r = 0

$H_A$: r <> 0

r <- cor(log(wine$price), wine$points, use = "complete.obs")
r

## [1] 0.592009

CIr(r, n = 280901, level = .95)

## [1] 0.5896017 0.5944057

The confidence interval does not capture $H_0$, p <0.001, therefore, we reject $H_0$. There is a statistically significant positive correlation between prices and points.

Summary

Simple linear regression summary:

Satisfying the assumptions.
r=0.592, r2=0.3505.
Model ANOVA, F(1,280899)=1.516e+05, p<.001.
a= -7.234, p<.001, 95% CI [-7.2869633 -7.1808615]
b= 0.119, p<.001, 95% CI [0.1188698 0.1200727]

Decision:

Overall model: Reject H0.
Intercept: Reject H0.
Slope: Reject H0.
log(price)= -7.234 +0.119 × points

Conclusion:

There is a statistically significant positive linear relationship between wine price and its point.

Discussion

There is a statistically significant positive linear relationship between wine price and quality in the markets. A log price is estimated to explain up to 35% of the variability in points.

log(price)= -7.234 +0.119 × points

Strengths:

Intensive explaination and model analysis.
Reasonable choices of two variables (outliers less than 5%)

Limitations:

Factors such as regions and brands can also affect wine prices.
There are still many unrated wines with expensive price on the markets.
Wine Enthusiast’s scaling ratio (scale 80 - 100) does not represent the industry standard rating system.

Propose directions for future investigations:

If we take into account regions and brands, how is it going to change?

Homework for audiences:

Giant Steps 2016 Tarraford Vineyard Chardonnay receives 92 points from Wine Enthusiast for its heavenly taste. Apply the function to get the estimated price and compare with its actual one found on Google.

Now give a toast to your effort!

References

Vincast, 2016, ‘Are Wine Ratings Good for Wine Investment?’, Forbes, 14 Sep, viewed October 16th, 2018 https://www.forbes.com/sites/auctionforecast/2016/09/14/are-wine-ratings-good-for-wine-investment/#333f523b3c8a

Zackthoutt, 2017, ‘Wine Reviews’, Kaggle, viewed October 16th, 2018 https://www.kaggle.com/zynicide/wine-reviews/home

LINEAR REGRESSION

Do Wine Points Affect Their Prices In the Markets?

Introduction

Problem Statement

Data

Data Cont.

Note: Data preprocessing tasks done prior to the analysis:

Decsriptive Statistics

Data Visualisation

Data Visualisation Cont. - Data transformation

Data Visualisation Cont.- Linear Plot

Hypothesis Testing - Overall Linear Regression Model

Hypothesis Testing - Interpretation

Hypothesis Testing - Linear Regression Model Parameters

Testing Assumptions

Testing Assumptions Cont.

Testing Assumptions Cont.

Testing Assumptions Cont.

Linear Strength and Direction of Linear Relationship

Summary

Discussion

References