I have choose this data from kaggle, in my data set there are 691 wine varieties among 18 countries grown in various areas of countries like state or specific valley regions. I have multilevel, where region is lower level, province is upper level and country is the most upper level. My dependent variable is price of wine bottle ranging from $7.00 to $775.00 and my independent variables are tasted of wine rated between 1-100, variety grown in states is “1” grown in specific valleys as “0”. My outcome is variable price, I want to see the wine bottle price based on taste, countries, regions grown. Here, i want to understand how price varies among various countries based on wine variety!

I will look at the wine bottle price based on wine variety grown in regions through complete pooling, based on countries with no pooling, and then use a linear mixed effect model to get a combination of pooling and no pooling.

library(readr)
library(nlme)
library(lme4)
library(dplyr)
library(magrittr)
library(tidyr)
library(haven)
library(lmerTest)
library(ggplot2)
library(texreg)
Wine<-read_csv("/Users/kanwallatif/Documents/winedata1.csv", col_names = TRUE) # Importing dataset file.
head(Wine)
## # A tibble: 6 x 15
##     XX1 variety Variety_Region country province points price description
##   <dbl> <chr>            <dbl> <chr>   <chr>     <dbl> <dbl> <chr>      
## 1     1 Portug…              1 Portug… Douro        87    15 This is ri…
## 2     2 Pinot …              0 US      Oregon       87    14 Tart and s…
## 3     3 Riesli…              0 US      Michigan     87    13 Pineapple …
## 4     4 Pinot …              0 US      Oregon       87    65 Much like …
## 5     5 Tempra…              0 Spain   Norther…     87    15 Blackberry…
## 6     6 Frappa…              0 Italy   Sicily …     87    16 Here's a b…
## # … with 7 more variables: designation <chr>, region_1 <chr>,
## #   region_2 <chr>, taster_name <chr>, taster_twitter_handle <chr>,
## #   title <chr>, winery <chr>
Wine$variety=as.factor(Wine$variety)

Number of countries:

length(unique(Wine$country)) # How many countries!
## [1] 18

Wine Variety in each country:

Wine %>% 
  group_by(country) %>% 
  summarise(n_vareity= n()) # How many wine variety in each country
## # A tibble: 18 x 2
##    country      n_vareity
##    <chr>            <int>
##  1 Argentina           22
##  2 Australia           19
##  3 Austria             16
##  4 Canada               1
##  5 Chile               35
##  6 France              84
##  7 Germany             17
##  8 Greece               3
##  9 Hungary              2
## 10 Israel               2
## 11 Italy              125
## 12 Mexico               2
## 13 New Zealand          7
## 14 Portugal            16
## 15 Romania              1
## 16 South Africa        18
## 17 Spain               21
## 18 US                 300

Ecological Regression Model(country level analysis):

My first step is to conduct ecological model, is the country level regression model. Here I am converting my unique wine variety data set to country level dataset. I am grouping by country and use summarize command to generate two new variables at country level.

wine2 <- Wine %>% 
  group_by(country) %>% 
  summarise(mean_p = mean(price, na.rm = TRUE), mean_s = mean(Variety_Region, na.rm = TRUE))
head(wine2)
## # A tibble: 6 x 3
##   country   mean_p mean_s
##   <chr>      <dbl>  <dbl>
## 1 Argentina   32.3 0.0455
## 2 Australia   93.2 0     
## 3 Austria     18.6 1     
## 4 Canada      30   0     
## 5 Chile       18.7 1     
## 6 France      50.1 0.0119

INTERPRETATION:
mean_p is the average country level wine bottle price and mean_s gives country level proportion wine grown in region 1

ecoreg <- lm(mean_p ~ mean_s, data = wine2)
summary(ecoreg)
## 
## Call:
## lm(formula = mean_p ~ mean_s, data = wine2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.089 -26.293 -13.453   4.239 131.911 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    43.19      16.67   2.590   0.0197 *
## mean_s          2.90      21.30   0.136   0.8934  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.68 on 16 degrees of freedom
## Multiple R-squared:  0.001158,   Adjusted R-squared:  -0.06127 
## F-statistic: 0.01854 on 1 and 16 DF,  p-value: 0.8934

INTERPRETATION:
At country level there seem to have a positive relationship between proportions of wine grown in region 1 and average wine bottle price.

Complete-pooling model(Region level analysis):

cpooling <- lm(price ~ Variety_Region, data = Wine)
summary(cpooling)
## 
## Call:
## lm(formula = price ~ Variety_Region, data = Wine)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -35.90 -22.38 -12.38   6.62 732.10 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      38.381      2.347  16.353   <2e-16 ***
## Variety_Region    4.523      4.651   0.973    0.331    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.26 on 689 degrees of freedom
## Multiple R-squared:  0.001371,   Adjusted R-squared:  -7.853e-05 
## F-statistic: 0.9458 on 1 and 689 DF,  p-value: 0.3311

INTERPRETATION:
This model is regardless of country level. The above results are showing that on an average in region1 price index for wine bottle is higher than region0 by $4.52

No-pooling Model:

Here i am estimating one model for each country, i will have 18 models since number of countries are 18 in my dataset.

The Intercept:
I am extracting regression results and putting them into a dataset and drawing a histogram of the estimated intercept.

dcoef <- Wine %>% 
    group_by(country) %>% 
    do(mod = lm(price ~ Variety_Region, data = .))
coef <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coef, aes(x = intc)) + geom_histogram(fill = "dark green")+xlab("Wine grown in Region0 over 18 countries")

INTERPRETATION:
Here i am looking at the distribution of country specific average wine bottle price of region 0 (wine grown in specif valleys) which ranges from 0 to 150. I see a dramatic between countries variation in the wine bottle price among region0 where wine is grown in valleys.

ggplot(coef, aes(x = intc)) + geom_density(fill="dark green")+xlab("Wine grown in Region0 over 18 countries")

The slop:
In slop everything remain same only variety_region is taking slop parameter for variety_region.

dcoef <- Wine %>% 
    group_by(country) %>% 
    do(mod = lm(price ~ Variety_Region, data = .))
coef <- dcoef %>% do(data.frame(Variety_Regionc = coef(.$mod)[2]))
ggplot(coef, aes(x = Variety_Regionc)) + geom_histogram(fill = "dark blue")+xlab("Region difference of wine bottle price among different countries.")

INTERPRETATION:
The above results showing the distribution of region difference of wine bottle price among different countries. Under negative range, in those countries on an average wine grown in valley regions have lower wine bottle price than wine grown in states and there is the vast majority of the countries in region0 with lower price. Negative range is greater than positive range this means less countries have on an average high wine bottle price.

ggplot(coef, aes(x = Variety_Regionc)) + geom_density(fill="dark blue")+xlab("Region difference of wine bottle price among different countries.")

Partial-pooling:

The strength of partial pooling is it allows between countries variations.

Random intercept:

m1_lme <- lme(price ~ Variety_Region, data = Wine, random = ~1|country, method = "ML")
summary(m1_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: Wine 
##        AIC      BIC    logLik
##   7428.649 7446.802 -3710.325
## 
## Random effects:
##  Formula: ~1 | country
##         (Intercept) Residual
## StdDev:     27.3376 50.84069
## 
## Fixed effects: price ~ Variety_Region 
##                   Value Std.Error  DF  t-value p-value
## (Intercept)    34.43845  8.640959 672 3.985490  0.0001
## Variety_Region 16.24517  6.758330 672 2.403725  0.0165
##  Correlation: 
##                (Intr)
## Variety_Region -0.436
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -1.5422759 -0.3361056 -0.1578131  0.1556265 13.5243960 
## 
## Number of Observations: 691
## Number of Groups: 18

INTERPRETATION: Here i want intercept of my regression model to vary freely i.e to follow normal distribution between countries. 27.33 is the parameter that governs between country variation. In other words, 27.33 is the estimated standard deviation of the country level normal distribution. Above results also showing that for wine grown in region state there is on an average increase of price by 16.24

Multilevel model #2

m2_lme <- lme(price ~ Variety_Region, data = Wine, random = ~ Variety_Region|country, method = "ML" , control=lmeControl(returnObject=TRUE))
summary(m2_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: Wine 
##        AIC      BIC    logLik
##   7431.918 7459.147 -3709.959
## 
## Random effects:
##  Formula: ~Variety_Region | country
##  Structure: General positive-definite, Log-Cholesky parametrization
##                StdDev   Corr  
## (Intercept)    21.53952 (Intr)
## Variety_Region 10.70701 0.999 
## Residual       50.81615       
## 
## Fixed effects: price ~ Variety_Region 
##                   Value Std.Error  DF  t-value p-value
## (Intercept)    36.01050  6.787248 672 5.305612    0.00
## Variety_Region 15.73422  6.748872 672 2.331385    0.02
##  Correlation: 
##                (Intr)
## Variety_Region -0.011
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -1.5818237 -0.3424760 -0.1525249  0.1546749 13.4921232 
## 
## Number of Observations: 691
## Number of Groups: 18

INTERPRETATION:
In the above results i have standard deviation for intercept 21.53 and for variety_region 10.7. Correlation between these two standard deviation is negative -0.11 that means in countries where wine is grown in region1 price of wine bottle is high, the region difference in price index seem to be low.

Adding Price of bottle at Country-level covariate:

m3_lme <- lme(price ~ Variety_Region + points, data = Wine, random = ~ Variety_Region|country, method = "ML", control=lmeControl(returnObject=TRUE))
summary(m3_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: Wine 
##        AIC      BIC    logLik
##   7266.219 7297.985 -3626.109
## 
## Random effects:
##  Formula: ~Variety_Region | country
##  Structure: General positive-definite, Log-Cholesky parametrization
##                StdDev    Corr  
## (Intercept)     4.524243 (Intr)
## Variety_Region 21.548148 -1    
## Residual       45.570090       
## 
## Fixed effects: price ~ Variety_Region + points 
##                    Value Std.Error  DF    t-value p-value
## (Intercept)    -763.5859  57.21551 671 -13.345784  0.0000
## Variety_Region   -0.9589   7.76242 671  -0.123533  0.9017
## points            9.0494   0.64295 671  14.074911  0.0000
##  Correlation: 
##                (Intr) Vrty_R
## Variety_Region -0.007       
## points         -0.999 -0.021
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -1.53289963 -0.41050253 -0.09077991  0.19666396 13.97515568 
## 
## Number of Observations: 691
## Number of Groups: 18

INTERPRETATION:
The model above states that the fixed affect among countries, the average price of wine bottle increase by 9.04 as points for wine tastes increases i.e people rank high for wine taste, price for wine bottle increases.

Best Model

# anova(cpooling, m1_lme, m2_lme, m3_lme, m4_lme)
htmlreg(list(cpooling, m1_lme, m2_lme, m3_lme))
Statistical models
Model 1 Model 2 Model 3 Model 4
(Intercept) 38.38*** 34.44*** 36.01*** -763.59***
(2.35) (8.64) (6.79) (57.22)
Variety_Region 4.52 16.25* 15.73* -0.96
(4.65) (6.76) (6.75) (7.76)
points 9.05***
(0.64)
R2 0.00
Adj. R2 -0.00
Num. obs. 691 691 691 691
RMSE 53.26
AIC 7428.65 7431.92 7266.22
BIC 7446.80 7459.15 7297.99
Log Likelihood -3710.32 -3709.96 -3626.11
Num. groups 18 18 18
p < 0.001, p < 0.01, p < 0.05

INTERPRETATION:
The best model is 3rd model, the partial pooling model with two independent variables as it fits the dataset best.

Cross-level interaction:

Is wine bottle being priced on country level or region level!

m4_lme <- lme(price ~ Variety_Region*points, data = Wine, random = ~ Variety_Region|country, method = "ML", control=lmeControl(returnObject=TRUE))
summary(m4_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: Wine 
##        AIC      BIC    logLik
##   7262.867 7299.172 -3623.433
## 
## Random effects:
##  Formula: ~Variety_Region | country
##  Structure: General positive-definite, Log-Cholesky parametrization
##                StdDev    Corr  
## (Intercept)     5.352814 (Intr)
## Variety_Region 20.490746 -1    
## Residual       45.415662       
## 
## Fixed effects: price ~ Variety_Region * points 
##                           Value Std.Error  DF    t-value p-value
## (Intercept)           -692.0714  64.53834 670 -10.723416  0.0000
## Variety_Region        -326.4368 138.85311 670  -2.350951  0.0190
## points                   8.2508   0.72525 670  11.376574  0.0000
## Variety_Region:points    3.6514   1.55681 670   2.345439  0.0193
##  Correlation: 
##                       (Intr) Vrty_R points
## Variety_Region        -0.465              
## points                -0.999  0.464       
## Variety_Region:points  0.465 -0.999 -0.466
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -1.5400089 -0.3964931 -0.0910068  0.1927289 13.6438838 
## 
## Number of Observations: 691
## Number of Groups: 18

INTERPRETATION:

The interaction model is used to demonstrate the correlation of the interaction between regions and points given by person on wine taste towards wine bottle taste among countries. The fixed effect of this model states that among countries, the average price of wine bottle increases by 3.65 for every unit increase in interaction between region and points for wine taste. In other words an increase in points for wine taste increases the region difference by 3.65

Best model:

# anova(cpooling, m1_lme, m2_lme, m3_lme, m4_lme)
htmlreg(list(cpooling, m1_lme, m2_lme, m3_lme, m4_lme))
Statistical models
Model 1 Model 2 Model 3 Model 4 Model 5
(Intercept) 38.38*** 34.44*** 36.01*** -763.59*** -692.07***
(2.35) (8.64) (6.79) (57.22) (64.54)
Variety_Region 4.52 16.25* 15.73* -0.96 -326.44*
(4.65) (6.76) (6.75) (7.76) (138.85)
points 9.05*** 8.25***
(0.64) (0.73)
Variety_Region:points 3.65*
(1.56)
R2 0.00
Adj. R2 -0.00
Num. obs. 691 691 691 691 691
RMSE 53.26
AIC 7428.65 7431.92 7266.22 7262.87
BIC 7446.80 7459.15 7297.99 7299.17
Log Likelihood -3710.32 -3709.96 -3626.11 -3623.43
Num. groups 18 18 18 18
p < 0.001, p < 0.01, p < 0.05

INTERPRETATION:
The best model is 4th model as it fits the data best best.

Centering covariates:

Wine %<>% mutate(cpoints = points - mean(points))

Refit the model:

m4a_lme <- lme(price ~ Variety_Region*cpoints, data = Wine, random = ~ Variety_Region|country, method = "ML", control=lmeControl(returnObject=TRUE))
summary(m4a_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: Wine 
##        AIC      BIC    logLik
##   7262.867 7299.172 -3623.433
## 
## Random effects:
##  Formula: ~Variety_Region | country
##  Structure: General positive-definite, Log-Cholesky parametrization
##                StdDev    Corr  
## (Intercept)     5.346892 (Intr)
## Variety_Region 20.490861 -1    
## Residual       45.415675       
## 
## Fixed effects: price ~ Variety_Region * cpoints 
##                           Value Std.Error  DF   t-value p-value
## (Intercept)            40.88942  2.821900 670 14.490031  0.0000
## Variety_Region         -2.06345  7.485856 670 -0.275647  0.7829
## cpoints                 8.25110  0.725236 670 11.377119  0.0000
## Variety_Region:cpoints  3.65094  1.556827 670  2.345118  0.0193
##  Correlation: 
##                        (Intr) Vrty_R cponts
## Variety_Region         -0.669              
## cpoints                -0.017  0.007       
## Variety_Region:cpoints  0.001 -0.047 -0.466
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -1.54010915 -0.39653542 -0.09106246  0.19275191 13.64380567 
## 
## Number of Observations: 691
## Number of Groups: 18

INTERPRETATION:
It gave same results except estimated parameters for region and intercepts.

Best model:

htmlreg(list(m4_lme, m4a_lme), doctype = FALSE)
Statistical models
Model 1 Model 2
(Intercept) -692.07*** 40.89***
(64.54) (2.82)
Variety_Region -326.44* -2.06
(138.85) (7.49)
points 8.25***
(0.73)
Variety_Region:points 3.65*
(1.56)
cpoints 8.25***
(0.73)
Variety_Region:cpoints 3.65*
(1.56)
AIC 7262.87 7262.87
BIC 7299.17 7299.17
Log Likelihood -3623.43 -3623.43
Num. obs. 691 691
Num. groups 18 18
p < 0.001, p < 0.01, p < 0.05

Intra-class correlation:

m0_lme <- lme(price ~ 1, random = ~ 1|country, data = Wine, method = "ML")
summary(m0_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: Wine 
##        AIC      BIC    logLik
##   7432.278 7445.892 -3713.139
## 
## Random effects:
##  Formula: ~1 | country
##         (Intercept) Residual
## StdDev:    24.78161 51.14146
## 
## Fixed effects: price ~ 1 
##                Value Std.Error  DF  t-value p-value
## (Intercept) 43.26867  7.205913 673 6.004607       0
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -1.4771289 -0.3547233 -0.1591872  0.1536705 13.5009345 
## 
## Number of Observations: 691
## Number of Groups: 18

Intra-class correlation (ICC): 24.78161/(24.78161+51.14146)=0.32640
About 32.64% of the total variation in wine bottle price may be attributed to country-level pricing. The other 67.36% can be attributed to the region level.

Computing confidence intervals:

intervals(m0_lme)
## Approximate 95% confidence intervals
## 
##  Fixed effects:
##                lower     est.    upper
## (Intercept) 29.13014 43.26867 57.40721
## attr(,"label")
## [1] "Fixed effects:"
## 
##  Random Effects:
##   Level: country 
##                    lower     est.    upper
## sd((Intercept)) 14.51986 24.78161 42.29575
## 
##  Within-group standard error:
##    lower     est.    upper 
## 48.47572 51.14146 53.95379

Conclusion:

It has been seen from above Ecological analysis that country level there seem to have a positive relationship between proportions of wine grown in region 1 and average wine bottle price. Under complete pooling, model is regardless of country level. It showed that on an average in region1 price index for wine bottle is higher than region0 by $4.52 Under no polling analysis, for intercept i looked at the distribution of country specific average wine bottle price of region 0 (wine grown in specif valleys), I see a dramatic between countries variation in the wine bottle price among region0 where wine is grown in valleys. For slope, showed the distribution of region difference of wine bottle price among different countries. Under negative range, in those countries on an average wine grown in valley regions have lower wine bottle price than wine grown in states and there is the vast majority of the countries in region0 with lower price.Partial pooling results showed that for wine grown in region state there is on an average increase of price by 16.24 Under multilevel analysis in countries where wine is grown in region1 price of wine bottle is high, the region difference in price index seem to be low. The average price of wine bottle increase by 9.04 as points for wine tastes increases i.e people rank high for wine taste, price for wine bottle increases.