LINEAR REGRESSION

Do Wine Points Affect Their Prices In the Markets?

Anh Viet My Phan (s3258110) - Oanh Tran Thao Kieu (s3425627)

Last updated: 27 October, 2018

Introduction

Problem Statement

“Do wine points have any effect on their prices in the markets?”

To answer all the questions above, we will use Linear analysis to check the relation between wine prices and their points.

Our analysis steps:

  1. Plotting the relationship between Point (x) and Price (y) to determine if there is any linear relationship and if has, identifying the linear trend.

  2. Fitting linear regression and performing all hypothesis tests of the various model components.

  3. Testing the various assumptions behind linear regression.

  4. Drawing the conclusion of output of a simple linear regression analysis and discussing the strengths/weaknesses.

Data

The chosen dataset is open data sourced from Kaggle.

The dataset contains wine information as below:

Two variables used for analysis - Prices (The dependent variable) and Points (The Predictor)

Data Cont.

wine <- read_csv("C:/Users/My Anh/Documents/Intro to Statistics/assignment 3/wine_combined.csv")

Note: Data preprocessing tasks done prior to the analysis:

Decsriptive Statistics

wine %>% summarise(Min = min(price,na.rm = TRUE),
                                           Q1 = quantile(price,probs = .25,na.rm = TRUE),
                                           Median = median(price, na.rm = TRUE),
                                           Q3 = quantile(price,probs = .75,na.rm = TRUE),
                                           Max = max(price,na.rm = TRUE),
                                           Mean = mean(price, na.rm = TRUE),
                                           SD = sd(price, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(price))) -> table1
knitr::kable(table1)
Min Q1 Median Q3 Max Mean SD n Missing
4 17 27 40 3300 34.16285 37.01939 280901 0
wine %>% summarise(Min = min(points,na.rm = TRUE),
                                           Q1 = quantile(points,probs = .25,na.rm = TRUE),
                                           Median = median(points, na.rm = TRUE),
                                           Q3 = quantile(points,probs = .75,na.rm = TRUE),
                                           Max = max(points,na.rm = TRUE),
                                           Mean = mean(points, na.rm = TRUE),
                                           SD = sd(points, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(points))) -> table2
knitr::kable(table2)
Min Q1 Median Q3 Max Mean SD n Missing
80 86 88 90 100 88.14693 3.151528 280901 0

Data Visualisation

plot(price ~ points, data = wine)

Data Visualisation Cont. - Data transformation

Solution: using log() function.

par(mfrow=c(2,2))
wine$price %>%  hist(main = "Price")
log(wine$price) %>%  hist (main = "log(Price)")
wine$points %>%  hist (main = "Points")
wine$points %>%  hist (main = "Points")

Data Visualisation Cont.- Linear Plot

plot(log(price) ~ points, data = wine)

There is likely to be positive relationship between Prices and Points.

Our linear equation: log(price) = a + b x points + e

Hypothesis Testing - Overall Linear Regression Model

A hypothesis test:

\(H_0\): The data does not fit the linear regression model.

\(H_A\): The data fits the linear regression model.

price_points <- lm(log(price) ~ points, data = wine)
price_points %>% summary()
## 
## Call:
## lm(formula = log(price) ~ points, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9091 -0.3655 -0.0449  0.3316  4.8221 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.2339124  0.0270672  -267.3   <2e-16 ***
## points       0.1194712  0.0003069   389.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5126 on 280899 degrees of freedom
## Multiple R-squared:  0.3505, Adjusted R-squared:  0.3505 
## F-statistic: 1.516e+05 on 1 and 280899 DF,  p-value: < 2.2e-16

(F(1,280899)=1.516e+05), p<.001: We reject \(H_0\)

There is statistically significant evidence that the data fits the linear regression model.

The best line fit: log(Price) = -7.234 + 0.119 x Points

Hypothesis Testing - Interpretation

Interpreting the interception:

When Points = 0, log(Price) = -7.234

It is impossible to start with 0 point because as stated in the data disclaimer, only wines that score at least 80 are taken into account. Therefore, the intercept does not have a meaningful interpretation (log can never be negative!)

Interpreting the slope:

As the point increases by 1, the log(Price) changes on the average by $0.119.

Intepreting R2:

R2 = 0.3505

Meaning: 35% of the variability in log(Price) can be explained by a linear relationship with points.

Hypothesis Testing - Linear Regression Model Parameters

price_points %>% confint()
##                  2.5 %     97.5 %
## (Intercept) -7.2869633 -7.1808615
## points       0.1188698  0.1200727

Intercept:

\(H_0\): a = 0

\(H_A\): a <> 0

Slope:

\(H_0\): b = 0

\(H_A\): b <> 0

Testing Assumptions

  1. Independent

Independence was assumed as each wine prices and points came from different and unique ID which represents each individuals.

  1. Linearity
price_points %>% plot(which = 1)

In the plot above, the variance appears to remain the same. The red line fits to the data.

Testing Assumptions Cont.

  1. Normality of residuals
price_points %>% plot(which = 2)

The point suggests there are no major deviations from normality. It would be safe to assume the residuals are approximately normally distributed.

Testing Assumptions Cont.

  1. Homoscedasticity
price_points %>% plot(which = 3)

The line fits to the data.The variance in residuals appears constant across predicted values.

Testing Assumptions Cont.

  1. Influential cases
price_points %>% plot(which = 5)

In the plot above, no values fall close to these bands. In fact, the bands are not even visible.

Linear Strength and Direction of Linear Relationship

Correlation Coefficient r:

\(H_0\): r = 0

\(H_A\): r <> 0

r <- cor(log(wine$price), wine$points, use = "complete.obs")
r
## [1] 0.592009
CIr(r, n = 280901, level = .95)
## [1] 0.5896017 0.5944057

The confidence interval does not capture \(H_0\), p <0.001, therefore, we reject \(H_0\). There is a statistically significant positive correlation between prices and points.

Summary

Simple linear regression summary:

Decision:

Conclusion:

Discussion

There is a statistically significant positive linear relationship between wine price and quality in the markets. A log price is estimated to explain up to 35% of the variability in points.

log(price)= -7.234 +0.119 × points

Strengths:

Limitations:

Propose directions for future investigations:

Homework for audiences:

Now give a toast to your effort!

References

Vincast, 2016, ‘Are Wine Ratings Good for Wine Investment?’, Forbes, 14 Sep, viewed October 16th, 2018 https://www.forbes.com/sites/auctionforecast/2016/09/14/are-wine-ratings-good-for-wine-investment/#333f523b3c8a

Zackthoutt, 2017, ‘Wine Reviews’, Kaggle, viewed October 16th, 2018 https://www.kaggle.com/zynicide/wine-reviews/home