I have choose this data from kaggle, in my data set there are 691 wine varieties among 18 countries grown in various areas of countries like state or specific valley regions. I have multilevel, where region is lower level, province is upper level and country is the most upper level. My dependent variable is price of wine bottle ranging from $7.00 to $775.00 and my independent variables are tasted of wine rated between 1-100, variety grown in states is “1” grown in specific valleys as “0”. My outcome is variable price, I want to see the wine bottle price based on taste, countries, regions grown. Here, i want to understand how price varies among various countries based on wine variety!
I will look at the wine bottle price based on wine variety grown in regions through complete pooling, based on countries with no pooling, and then use a linear mixed effect model to get a combination of pooling and no pooling.
library(readr)
library(nlme)
library(lme4)
library(dplyr)
library(magrittr)
library(tidyr)
library(haven)
library(lmerTest)
library(ggplot2)
library(texreg)
Wine<-read_csv("/Users/kanwallatif/Documents/winedata1.csv", col_names = TRUE) # Importing dataset file.
head(Wine)
## # A tibble: 6 x 15
## XX1 variety Variety_Region country province points price description
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 1 Portug… 1 Portug… Douro 87 15 This is ri…
## 2 2 Pinot … 0 US Oregon 87 14 Tart and s…
## 3 3 Riesli… 0 US Michigan 87 13 Pineapple …
## 4 4 Pinot … 0 US Oregon 87 65 Much like …
## 5 5 Tempra… 0 Spain Norther… 87 15 Blackberry…
## 6 6 Frappa… 0 Italy Sicily … 87 16 Here's a b…
## # … with 7 more variables: designation <chr>, region_1 <chr>,
## # region_2 <chr>, taster_name <chr>, taster_twitter_handle <chr>,
## # title <chr>, winery <chr>
Wine$variety=as.factor(Wine$variety)
length(unique(Wine$country)) # How many countries!
## [1] 18
Wine %>%
group_by(country) %>%
summarise(n_vareity= n()) # How many wine variety in each country
## # A tibble: 18 x 2
## country n_vareity
## <chr> <int>
## 1 Argentina 22
## 2 Australia 19
## 3 Austria 16
## 4 Canada 1
## 5 Chile 35
## 6 France 84
## 7 Germany 17
## 8 Greece 3
## 9 Hungary 2
## 10 Israel 2
## 11 Italy 125
## 12 Mexico 2
## 13 New Zealand 7
## 14 Portugal 16
## 15 Romania 1
## 16 South Africa 18
## 17 Spain 21
## 18 US 300
My first step is to conduct ecological model, is the country level regression model. Here I am converting my unique wine variety data set to country level dataset. I am grouping by country and use summarize command to generate two new variables at country level.
wine2 <- Wine %>%
group_by(country) %>%
summarise(mean_p = mean(price, na.rm = TRUE), mean_s = mean(Variety_Region, na.rm = TRUE))
head(wine2)
## # A tibble: 6 x 3
## country mean_p mean_s
## <chr> <dbl> <dbl>
## 1 Argentina 32.3 0.0455
## 2 Australia 93.2 0
## 3 Austria 18.6 1
## 4 Canada 30 0
## 5 Chile 18.7 1
## 6 France 50.1 0.0119
INTERPRETATION:
mean_p is the average country level wine bottle price and mean_s gives country level proportion wine grown in region 1
ecoreg <- lm(mean_p ~ mean_s, data = wine2)
summary(ecoreg)
##
## Call:
## lm(formula = mean_p ~ mean_s, data = wine2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.089 -26.293 -13.453 4.239 131.911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.19 16.67 2.590 0.0197 *
## mean_s 2.90 21.30 0.136 0.8934
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.68 on 16 degrees of freedom
## Multiple R-squared: 0.001158, Adjusted R-squared: -0.06127
## F-statistic: 0.01854 on 1 and 16 DF, p-value: 0.8934
INTERPRETATION:
At country level there seem to have a positive relationship between proportions of wine grown in region 1 and average wine bottle price.
cpooling <- lm(price ~ Variety_Region, data = Wine)
summary(cpooling)
##
## Call:
## lm(formula = price ~ Variety_Region, data = Wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.90 -22.38 -12.38 6.62 732.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.381 2.347 16.353 <2e-16 ***
## Variety_Region 4.523 4.651 0.973 0.331
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53.26 on 689 degrees of freedom
## Multiple R-squared: 0.001371, Adjusted R-squared: -7.853e-05
## F-statistic: 0.9458 on 1 and 689 DF, p-value: 0.3311
INTERPRETATION:
This model is regardless of country level. The above results are showing that on an average in region1 price index for wine bottle is higher than region0 by $4.52
Here i am estimating one model for each country, i will have 18 models since number of countries are 18 in my dataset.
The Intercept:
I am extracting regression results and putting them into a dataset and drawing a histogram of the estimated intercept.
dcoef <- Wine %>%
group_by(country) %>%
do(mod = lm(price ~ Variety_Region, data = .))
coef <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coef, aes(x = intc)) + geom_histogram(fill = "dark green")+xlab("Wine grown in Region0 over 18 countries")
INTERPRETATION:
Here i am looking at the distribution of country specific average wine bottle price of region 0 (wine grown in specif valleys) which ranges from 0 to 150. I see a dramatic between countries variation in the wine bottle price among region0 where wine is grown in valleys.
ggplot(coef, aes(x = intc)) + geom_density(fill="dark green")+xlab("Wine grown in Region0 over 18 countries")
The slop:
In slop everything remain same only variety_region is taking slop parameter for variety_region.
dcoef <- Wine %>%
group_by(country) %>%
do(mod = lm(price ~ Variety_Region, data = .))
coef <- dcoef %>% do(data.frame(Variety_Regionc = coef(.$mod)[2]))
ggplot(coef, aes(x = Variety_Regionc)) + geom_histogram(fill = "dark blue")+xlab("Region difference of wine bottle price among different countries.")
INTERPRETATION:
The above results showing the distribution of region difference of wine bottle price among different countries. Under negative range, in those countries on an average wine grown in valley regions have lower wine bottle price than wine grown in states and there is the vast majority of the countries in region0 with lower price. Negative range is greater than positive range this means less countries have on an average high wine bottle price.
ggplot(coef, aes(x = Variety_Regionc)) + geom_density(fill="dark blue")+xlab("Region difference of wine bottle price among different countries.")
The strength of partial pooling is it allows between countries variations.
Random intercept:
m1_lme <- lme(price ~ Variety_Region, data = Wine, random = ~1|country, method = "ML")
summary(m1_lme)
## Linear mixed-effects model fit by maximum likelihood
## Data: Wine
## AIC BIC logLik
## 7428.649 7446.802 -3710.325
##
## Random effects:
## Formula: ~1 | country
## (Intercept) Residual
## StdDev: 27.3376 50.84069
##
## Fixed effects: price ~ Variety_Region
## Value Std.Error DF t-value p-value
## (Intercept) 34.43845 8.640959 672 3.985490 0.0001
## Variety_Region 16.24517 6.758330 672 2.403725 0.0165
## Correlation:
## (Intr)
## Variety_Region -0.436
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.5422759 -0.3361056 -0.1578131 0.1556265 13.5243960
##
## Number of Observations: 691
## Number of Groups: 18
INTERPRETATION: Here i want intercept of my regression model to vary freely i.e to follow normal distribution between countries. 27.33 is the parameter that governs between country variation. In other words, 27.33 is the estimated standard deviation of the country level normal distribution. Above results also showing that for wine grown in region state there is on an average increase of price by 16.24
m2_lme <- lme(price ~ Variety_Region, data = Wine, random = ~ Variety_Region|country, method = "ML" , control=lmeControl(returnObject=TRUE))
summary(m2_lme)
## Linear mixed-effects model fit by maximum likelihood
## Data: Wine
## AIC BIC logLik
## 7431.918 7459.147 -3709.959
##
## Random effects:
## Formula: ~Variety_Region | country
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 21.53952 (Intr)
## Variety_Region 10.70701 0.999
## Residual 50.81615
##
## Fixed effects: price ~ Variety_Region
## Value Std.Error DF t-value p-value
## (Intercept) 36.01050 6.787248 672 5.305612 0.00
## Variety_Region 15.73422 6.748872 672 2.331385 0.02
## Correlation:
## (Intr)
## Variety_Region -0.011
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.5818237 -0.3424760 -0.1525249 0.1546749 13.4921232
##
## Number of Observations: 691
## Number of Groups: 18
INTERPRETATION:
In the above results i have standard deviation for intercept 21.53 and for variety_region 10.7. Correlation between these two standard deviation is negative -0.11 that means in countries where wine is grown in region1 price of wine bottle is high, the region difference in price index seem to be low.
m3_lme <- lme(price ~ Variety_Region + points, data = Wine, random = ~ Variety_Region|country, method = "ML", control=lmeControl(returnObject=TRUE))
summary(m3_lme)
## Linear mixed-effects model fit by maximum likelihood
## Data: Wine
## AIC BIC logLik
## 7266.219 7297.985 -3626.109
##
## Random effects:
## Formula: ~Variety_Region | country
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 4.524243 (Intr)
## Variety_Region 21.548148 -1
## Residual 45.570090
##
## Fixed effects: price ~ Variety_Region + points
## Value Std.Error DF t-value p-value
## (Intercept) -763.5859 57.21551 671 -13.345784 0.0000
## Variety_Region -0.9589 7.76242 671 -0.123533 0.9017
## points 9.0494 0.64295 671 14.074911 0.0000
## Correlation:
## (Intr) Vrty_R
## Variety_Region -0.007
## points -0.999 -0.021
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.53289963 -0.41050253 -0.09077991 0.19666396 13.97515568
##
## Number of Observations: 691
## Number of Groups: 18
INTERPRETATION:
The model above states that the fixed affect among countries, the average price of wine bottle increase by 9.04 as points for wine tastes increases i.e people rank high for wine taste, price for wine bottle increases.
# anova(cpooling, m1_lme, m2_lme, m3_lme, m4_lme)
htmlreg(list(cpooling, m1_lme, m2_lme, m3_lme))
| Model 1 | Model 2 | Model 3 | Model 4 | ||
|---|---|---|---|---|---|
| (Intercept) | 38.38*** | 34.44*** | 36.01*** | -763.59*** | |
| (2.35) | (8.64) | (6.79) | (57.22) | ||
| Variety_Region | 4.52 | 16.25* | 15.73* | -0.96 | |
| (4.65) | (6.76) | (6.75) | (7.76) | ||
| points | 9.05*** | ||||
| (0.64) | |||||
| R2 | 0.00 | ||||
| Adj. R2 | -0.00 | ||||
| Num. obs. | 691 | 691 | 691 | 691 | |
| RMSE | 53.26 | ||||
| AIC | 7428.65 | 7431.92 | 7266.22 | ||
| BIC | 7446.80 | 7459.15 | 7297.99 | ||
| Log Likelihood | -3710.32 | -3709.96 | -3626.11 | ||
| Num. groups | 18 | 18 | 18 | ||
| p < 0.001, p < 0.01, p < 0.05 | |||||
INTERPRETATION:
The best model is 3rd model, the partial pooling model with two independent variables as it fits the dataset best.
Is wine bottle being priced on country level or region level!
m4_lme <- lme(price ~ Variety_Region*points, data = Wine, random = ~ Variety_Region|country, method = "ML", control=lmeControl(returnObject=TRUE))
summary(m4_lme)
## Linear mixed-effects model fit by maximum likelihood
## Data: Wine
## AIC BIC logLik
## 7262.867 7299.172 -3623.433
##
## Random effects:
## Formula: ~Variety_Region | country
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 5.352814 (Intr)
## Variety_Region 20.490746 -1
## Residual 45.415662
##
## Fixed effects: price ~ Variety_Region * points
## Value Std.Error DF t-value p-value
## (Intercept) -692.0714 64.53834 670 -10.723416 0.0000
## Variety_Region -326.4368 138.85311 670 -2.350951 0.0190
## points 8.2508 0.72525 670 11.376574 0.0000
## Variety_Region:points 3.6514 1.55681 670 2.345439 0.0193
## Correlation:
## (Intr) Vrty_R points
## Variety_Region -0.465
## points -0.999 0.464
## Variety_Region:points 0.465 -0.999 -0.466
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.5400089 -0.3964931 -0.0910068 0.1927289 13.6438838
##
## Number of Observations: 691
## Number of Groups: 18
INTERPRETATION:
The interaction model is used to demonstrate the correlation of the interaction between regions and points given by person on wine taste towards wine bottle taste among countries. The fixed effect of this model states that among countries, the average price of wine bottle increases by 3.65 for every unit increase in interaction between region and points for wine taste. In other words an increase in points for wine taste increases the region difference by 3.65
# anova(cpooling, m1_lme, m2_lme, m3_lme, m4_lme)
htmlreg(list(cpooling, m1_lme, m2_lme, m3_lme, m4_lme))
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | ||
|---|---|---|---|---|---|---|
| (Intercept) | 38.38*** | 34.44*** | 36.01*** | -763.59*** | -692.07*** | |
| (2.35) | (8.64) | (6.79) | (57.22) | (64.54) | ||
| Variety_Region | 4.52 | 16.25* | 15.73* | -0.96 | -326.44* | |
| (4.65) | (6.76) | (6.75) | (7.76) | (138.85) | ||
| points | 9.05*** | 8.25*** | ||||
| (0.64) | (0.73) | |||||
| Variety_Region:points | 3.65* | |||||
| (1.56) | ||||||
| R2 | 0.00 | |||||
| Adj. R2 | -0.00 | |||||
| Num. obs. | 691 | 691 | 691 | 691 | 691 | |
| RMSE | 53.26 | |||||
| AIC | 7428.65 | 7431.92 | 7266.22 | 7262.87 | ||
| BIC | 7446.80 | 7459.15 | 7297.99 | 7299.17 | ||
| Log Likelihood | -3710.32 | -3709.96 | -3626.11 | -3623.43 | ||
| Num. groups | 18 | 18 | 18 | 18 | ||
| p < 0.001, p < 0.01, p < 0.05 | ||||||
INTERPRETATION:
The best model is 4th model as it fits the data best best.
Wine %<>% mutate(cpoints = points - mean(points))
m4a_lme <- lme(price ~ Variety_Region*cpoints, data = Wine, random = ~ Variety_Region|country, method = "ML", control=lmeControl(returnObject=TRUE))
summary(m4a_lme)
## Linear mixed-effects model fit by maximum likelihood
## Data: Wine
## AIC BIC logLik
## 7262.867 7299.172 -3623.433
##
## Random effects:
## Formula: ~Variety_Region | country
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 5.346892 (Intr)
## Variety_Region 20.490861 -1
## Residual 45.415675
##
## Fixed effects: price ~ Variety_Region * cpoints
## Value Std.Error DF t-value p-value
## (Intercept) 40.88942 2.821900 670 14.490031 0.0000
## Variety_Region -2.06345 7.485856 670 -0.275647 0.7829
## cpoints 8.25110 0.725236 670 11.377119 0.0000
## Variety_Region:cpoints 3.65094 1.556827 670 2.345118 0.0193
## Correlation:
## (Intr) Vrty_R cponts
## Variety_Region -0.669
## cpoints -0.017 0.007
## Variety_Region:cpoints 0.001 -0.047 -0.466
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.54010915 -0.39653542 -0.09106246 0.19275191 13.64380567
##
## Number of Observations: 691
## Number of Groups: 18
INTERPRETATION:
It gave same results except estimated parameters for region and intercepts.
htmlreg(list(m4_lme, m4a_lme), doctype = FALSE)
| Model 1 | Model 2 | ||
|---|---|---|---|
| (Intercept) | -692.07*** | 40.89*** | |
| (64.54) | (2.82) | ||
| Variety_Region | -326.44* | -2.06 | |
| (138.85) | (7.49) | ||
| points | 8.25*** | ||
| (0.73) | |||
| Variety_Region:points | 3.65* | ||
| (1.56) | |||
| cpoints | 8.25*** | ||
| (0.73) | |||
| Variety_Region:cpoints | 3.65* | ||
| (1.56) | |||
| AIC | 7262.87 | 7262.87 | |
| BIC | 7299.17 | 7299.17 | |
| Log Likelihood | -3623.43 | -3623.43 | |
| Num. obs. | 691 | 691 | |
| Num. groups | 18 | 18 | |
| p < 0.001, p < 0.01, p < 0.05 | |||
m0_lme <- lme(price ~ 1, random = ~ 1|country, data = Wine, method = "ML")
summary(m0_lme)
## Linear mixed-effects model fit by maximum likelihood
## Data: Wine
## AIC BIC logLik
## 7432.278 7445.892 -3713.139
##
## Random effects:
## Formula: ~1 | country
## (Intercept) Residual
## StdDev: 24.78161 51.14146
##
## Fixed effects: price ~ 1
## Value Std.Error DF t-value p-value
## (Intercept) 43.26867 7.205913 673 6.004607 0
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.4771289 -0.3547233 -0.1591872 0.1536705 13.5009345
##
## Number of Observations: 691
## Number of Groups: 18
Intra-class correlation (ICC): 24.78161/(24.78161+51.14146)=0.32640
About 32.64% of the total variation in wine bottle price may be attributed to country-level pricing. The other 67.36% can be attributed to the region level.
intervals(m0_lme)
## Approximate 95% confidence intervals
##
## Fixed effects:
## lower est. upper
## (Intercept) 29.13014 43.26867 57.40721
## attr(,"label")
## [1] "Fixed effects:"
##
## Random Effects:
## Level: country
## lower est. upper
## sd((Intercept)) 14.51986 24.78161 42.29575
##
## Within-group standard error:
## lower est. upper
## 48.47572 51.14146 53.95379
It has been seen from above Ecological analysis that country level there seem to have a positive relationship between proportions of wine grown in region 1 and average wine bottle price. Under complete pooling, model is regardless of country level. It showed that on an average in region1 price index for wine bottle is higher than region0 by $4.52 Under no polling analysis, for intercept i looked at the distribution of country specific average wine bottle price of region 0 (wine grown in specif valleys), I see a dramatic between countries variation in the wine bottle price among region0 where wine is grown in valleys. For slope, showed the distribution of region difference of wine bottle price among different countries. Under negative range, in those countries on an average wine grown in valley regions have lower wine bottle price than wine grown in states and there is the vast majority of the countries in region0 with lower price.Partial pooling results showed that for wine grown in region state there is on an average increase of price by 16.24 Under multilevel analysis in countries where wine is grown in region1 price of wine bottle is high, the region difference in price index seem to be low. The average price of wine bottle increase by 9.04 as points for wine tastes increases i.e people rank high for wine taste, price for wine bottle increases.