Anh Viet My Phan (s3258110) - Oanh Tran Thao Kieu (s3425627)
Last updated: 27 October, 2018
Wine points has been existing in the wine industry for decades.
There have been strong opinions about the correlation between wine points given by critics and their prices (Vincast, 2016)
But it is subjective when it comes to tastes, critics may have different opinions.
Are critics’ opinions worth your considerations to spend your money?
Therefore, this investigation aims at determining whether there is a linear relationship between wine points and their prices in the markets.
“Do wine points have any effect on their prices in the markets?”
To answer all the questions above, we will use Linear analysis to check the relation between wine prices and their points.
Our analysis steps:
Plotting the relationship between Point (x) and Price (y) to determine if there is any linear relationship and if has, identifying the linear trend.
Fitting linear regression and performing all hypothesis tests of the various model components.
Testing the various assumptions behind linear regression.
Drawing the conclusion of output of a simple linear regression analysis and discussing the strengths/weaknesses.
The chosen dataset is open data sourced from Kaggle.
The dataset contains wine information as below:
Wine originals
Wine review descriptions
Wine prices (in Dollars)
Wine points rated by WineEnthusiasts (based on Quality, Body, and Aroma)
Tasters’ Names
Wine variety
Winery
Two variables used for analysis - Prices (The dependent variable) and Points (The Predictor)
wine <- read_csv("C:/Users/My Anh/Documents/Intro to Statistics/assignment 3/wine_combined.csv")Number of observations: 280,901
Number of variables: 11
Joining two separate wine review datasets by rbind() function.
Filter out unnecessary variables.
Addressing NA values in Price variable by mean imputation.
wine %>% summarise(Min = min(price,na.rm = TRUE),
Q1 = quantile(price,probs = .25,na.rm = TRUE),
Median = median(price, na.rm = TRUE),
Q3 = quantile(price,probs = .75,na.rm = TRUE),
Max = max(price,na.rm = TRUE),
Mean = mean(price, na.rm = TRUE),
SD = sd(price, na.rm = TRUE),
n = n(),
Missing = sum(is.na(price))) -> table1
knitr::kable(table1)| Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|
| 4 | 17 | 27 | 40 | 3300 | 34.16285 | 37.01939 | 280901 | 0 |
wine %>% summarise(Min = min(points,na.rm = TRUE),
Q1 = quantile(points,probs = .25,na.rm = TRUE),
Median = median(points, na.rm = TRUE),
Q3 = quantile(points,probs = .75,na.rm = TRUE),
Max = max(points,na.rm = TRUE),
Mean = mean(points, na.rm = TRUE),
SD = sd(points, na.rm = TRUE),
n = n(),
Missing = sum(is.na(points))) -> table2
knitr::kable(table2)| Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|
| 80 | 86 | 88 | 90 | 100 | 88.14693 | 3.151528 | 280901 | 0 |
No actions needed to tackle Price’s outliers (less than 5%).
Plot the relationship between y (Price) and x(Points) on a scatter plot.
plot(price ~ points, data = wine)Solution: using log() function.
par(mfrow=c(2,2))
wine$price %>% hist(main = "Price")
log(wine$price) %>% hist (main = "log(Price)")
wine$points %>% hist (main = "Points")
wine$points %>% hist (main = "Points")plot(log(price) ~ points, data = wine)There is likely to be positive relationship between Prices and Points.
Our linear equation: log(price) = a + b x points + e
A hypothesis test:
\(H_0\): The data does not fit the linear regression model.
\(H_A\): The data fits the linear regression model.
price_points <- lm(log(price) ~ points, data = wine)
price_points %>% summary()##
## Call:
## lm(formula = log(price) ~ points, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9091 -0.3655 -0.0449 0.3316 4.8221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.2339124 0.0270672 -267.3 <2e-16 ***
## points 0.1194712 0.0003069 389.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5126 on 280899 degrees of freedom
## Multiple R-squared: 0.3505, Adjusted R-squared: 0.3505
## F-statistic: 1.516e+05 on 1 and 280899 DF, p-value: < 2.2e-16
(F(1,280899)=1.516e+05), p<.001: We reject \(H_0\)
There is statistically significant evidence that the data fits the linear regression model.
The best line fit: log(Price) = -7.234 + 0.119 x Points
Interpreting the interception:
When Points = 0, log(Price) = -7.234
It is impossible to start with 0 point because as stated in the data disclaimer, only wines that score at least 80 are taken into account. Therefore, the intercept does not have a meaningful interpretation (log can never be negative!)
Interpreting the slope:
As the point increases by 1, the log(Price) changes on the average by $0.119.
Intepreting R2:
R2 = 0.3505
Meaning: 35% of the variability in log(Price) can be explained by a linear relationship with points.
price_points %>% confint()## 2.5 % 97.5 %
## (Intercept) -7.2869633 -7.1808615
## points 0.1188698 0.1200727
Intercept:
\(H_0\): a = 0
\(H_A\): a <> 0
Slope:
\(H_0\): b = 0
\(H_A\): b <> 0
Independence was assumed as each wine prices and points came from different and unique ID which represents each individuals.
price_points %>% plot(which = 1)In the plot above, the variance appears to remain the same. The red line fits to the data.
price_points %>% plot(which = 2)The point suggests there are no major deviations from normality. It would be safe to assume the residuals are approximately normally distributed.
price_points %>% plot(which = 3)The line fits to the data.The variance in residuals appears constant across predicted values.
price_points %>% plot(which = 5)In the plot above, no values fall close to these bands. In fact, the bands are not even visible.
Correlation Coefficient r:
\(H_0\): r = 0
\(H_A\): r <> 0
r <- cor(log(wine$price), wine$points, use = "complete.obs")
r## [1] 0.592009
CIr(r, n = 280901, level = .95)## [1] 0.5896017 0.5944057
The confidence interval does not capture \(H_0\), p <0.001, therefore, we reject \(H_0\). There is a statistically significant positive correlation between prices and points.
Simple linear regression summary:
Satisfying the assumptions.
r=0.592, r2=0.3505.
Model ANOVA, F(1,280899)=1.516e+05, p<.001.
a= -7.234, p<.001, 95% CI [-7.2869633 -7.1808615]
b= 0.119, p<.001, 95% CI [0.1188698 0.1200727]
Decision:
Overall model: Reject H0.
Intercept: Reject H0.
Slope: Reject H0.
log(price)= -7.234 +0.119 × points
Conclusion:
There is a statistically significant positive linear relationship between wine price and quality in the markets. A log price is estimated to explain up to 35% of the variability in points.
log(price)= -7.234 +0.119 × points
Strengths:
Intensive explaination and model analysis.
Reasonable choices of two variables (outliers less than 5%)
Limitations:
Factors such as regions and brands can also affect wine prices.
There are still many unrated wines with expensive price on the markets.
Wine Enthusiast’s scaling ratio (scale 80 - 100) does not represent the industry standard rating system.
Propose directions for future investigations:
Homework for audiences:
Now give a toast to your effort!
Vincast, 2016, ‘Are Wine Ratings Good for Wine Investment?’, Forbes, 14 Sep, viewed October 16th, 2018 https://www.forbes.com/sites/auctionforecast/2016/09/14/are-wine-ratings-good-for-wine-investment/#333f523b3c8a
Zackthoutt, 2017, ‘Wine Reviews’, Kaggle, viewed October 16th, 2018 https://www.kaggle.com/zynicide/wine-reviews/home