Week Eight: How does bean type, Cocoa percentage and company location affect Chocolate bar rating in the USA? Does the most popular cocoa bean make a quality chocolate bar?

Data

The dataset I used for this analysis was found on kaggle.com with 1795 observations. Variables that will be focused on will be where the cocoa bean orginated from, where the company is located, and the rating of the chocolate bar itself on a scale from 0-5. I also wanted to focus on one type of bean which is the more common one being used in the production of chocolate bars, Forastero and changed that variable into a binomial variable, 1 if it used , even partially, and 0 if it did not. I assigned a unique number to each different observation in the company loaction variable. There is a total of 60 different unique observations in that one variable.

library(nlme)
library(dplyr)
library(magrittr)
library(tidyr)
library(haven)
library(lmerTest)
library(ggplot2)
library(texreg)

chocolate<- read.csv("C:/Users/Jessica/Desktop/712/chocolate.csv")


head(chocolate)

##    company  REF review_date cocoa_dec cocoa_perc company_location rating
## 1 A. Morin 1876        2016      0.63         63           France   3.75
## 2 A. Morin 1676        2015      0.70         70           France   2.75
## 3 A. Morin 1676        2015      0.70         70           France   3.00
## 4 A. Morin 1680        2015      0.70         70           France   3.50
## 5 A. Morin 1704        2015      0.70         70           France   3.50
## 6 A. Morin 1315        2014      0.70         70           France   2.75
##   bean_type    origin
## 1     other  Sao Tome
## 2     other      Togo
## 3     other      Togo
## 4     other      Togo
## 5     other      Peru
## 6   Criollo Venezuela

chocolate<-chocolate%>%
  mutate(bean_type=ifelse(bean_type =="Forastero", "Forastero","other"))
chocolate<-chocolate%>%
  mutate(bean_type=ifelse(bean_type == "Forastero" , 1, 0))


length(unique(chocolate$company_location))

## [1] 60

chocolate %>% 
  group_by(company_location) %>% 
  summarise(n_sch = n())

## # A tibble: 60 x 2
##    company_location n_sch
##    <fct>            <int>
##  1 Amsterdam            4
##  2 Argentina            9
##  3 Australia           49
##  4 Austria             26
##  5 Belgium             40
##  6 Bolivia              2
##  7 Brazil              17
##  8 Canada             125
##  9 Chile                2
## 10 Colombia            23
## # ... with 50 more rows

Ecological Analysis

Here I ran an ecological analysis with each individual country, their average rating of chocolate and whether it used the Forastero bean.

Here we see a negative proportion between the forastero bean and the rating score. A non-forastero bean has a higher scoring rating than the forastero bean.

usa <- chocolate %>% 
  group_by(company_location) %>% 
  summarise(mean_p = mean(rating, na.rm = TRUE), mean_s = mean(bean_type, na.rm = TRUE))
head(usa)

## # A tibble: 6 x 3
##   company_location mean_p mean_s
##   <fct>             <dbl>  <dbl>
## 1 Amsterdam          3.5  0.25  
## 2 Argentina          3.31 0.444 
## 3 Australia          3.36 0.0816
## 4 Austria            3.24 0.154 
## 5 Belgium            3.09 0.2   
## 6 Bolivia            3.25 0

bean <- lm(mean_p ~ mean_s, data = usa)
summary(bean)

## 
## Call:
## lm(formula = mean_p ~ mean_s, data = usa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64219 -0.13000  0.05544  0.18868  0.60781 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.14219    0.03857  81.477   <2e-16 ***
## mean_s      -0.06955    0.13831  -0.503    0.617    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2586 on 58 degrees of freedom
## Multiple R-squared:  0.004341,   Adjusted R-squared:  -0.01283 
## F-statistic: 0.2528 on 1 and 58 DF,  p-value: 0.617

Complete-pooling model

Here I ran the complete pooling model, treating each company location as the same with no difference. Because there are 60 different locations used in the variable, there will be 60 different regression models, one for each location.

The results of 3.19 and -0.17 for the intercept and bean type shows that there are many countervailing forces.

cpooling <- lm(rating ~ bean_type, data = chocolate)
summary(cpooling)

## 
## Call:
## lm(formula = rating ~ bean_type, data = chocolate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.19408 -0.28234  0.05592  0.30592  1.80592 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.19408    0.01196 267.155   <2e-16 ***
## bean_type   -0.07348    0.03591  -2.046   0.0409 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4776 on 1793 degrees of freedom
## Multiple R-squared:  0.00233,    Adjusted R-squared:  0.001773 
## F-statistic: 4.187 on 1 and 1793 DF,  p-value: 0.04088

The Intercept

The graph shows that there is a difference of rating scores between each country. The range shown is the country specific average rating for non-forastero beans ranging from 2.5-4.

dcoef <- chocolate %>% 
  group_by(company_location) %>% 
  do(mod = lm(rating ~ bean_type, data = .))
coef <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coef, aes(x = intc)) + geom_histogram()

Slope

The distribution in Forastero bean difference shows that non-forastero beans have a higher rating.

dcoef <- chocolate %>% 
  group_by(company_location) %>% 
  do(mod = lm(rating ~ bean_type, data = .))
coef <- dcoef %>% do(data.frame(sexc = coef(.$mod)[2]))
ggplot(coef, aes(x = sexc)) + geom_histogram()

Random Intercept

m1_lme <- lme(rating ~ bean_type, data = chocolate, random = ~1|company_location, method = "ML")
summary(m1_lme)

## Linear mixed-effects model fit by maximum likelihood
##  Data: chocolate 
##        AIC      BIC    logLik
##   2427.166 2449.137 -1209.583
## 
## Random effects:
##  Formula: ~1 | company_location
##         (Intercept)  Residual
## StdDev:  0.08808265 0.4714271
## 
## Fixed effects: rating ~ bean_type 
##                 Value  Std.Error   DF   t-value p-value
## (Intercept)  3.200117 0.02252802 1734 142.05052  0.0000
## bean_type   -0.066493 0.03668728 1734  -1.81243  0.0701
##  Correlation: 
##           (Intr)
## bean_type -0.223
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -4.5518922 -0.6483808  0.1058487  0.7200120  3.6244885 
## 
## Number of Observations: 1795
## Number of Groups: 60

Multilevel model #2

.07 estimated standard deviation of the theoretical normal distriubtion which covers the between company location change in the rating country change. .06 is the bean type slope..647 shows the between the two random components, in other words where the non-foraster bean is rated high, the bean type difference rating is also high.

m2_lme <- lme(rating ~ bean_type, data = chocolate, random = ~ bean_type|company_location, method = "ML", control = lmeControl(opt = 'optim'))
summary(m2_lme)

## Linear mixed-effects model fit by maximum likelihood
##  Data: chocolate 
##        AIC      BIC    logLik
##   2429.655 2462.612 -1208.828
## 
## Random effects:
##  Formula: ~bean_type | company_location
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev     Corr  
## (Intercept) 0.07896590 (Intr)
## bean_type   0.06872798 0.647 
## Residual    0.47109111       
## 
## Fixed effects: rating ~ bean_type 
##                 Value  Std.Error   DF   t-value p-value
## (Intercept)  3.201326 0.02117919 1734 151.15428  0.0000
## bean_type   -0.060216 0.04169026 1734  -1.44436  0.1488
##  Correlation: 
##           (Intr)
## bean_type -0.059
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -4.5764691 -0.6403946  0.1001228  0.7177037  3.6529112 
## 
## Number of Observations: 1795
## Number of Groups: 60

Model Selection

In the model selection, the second model is the better fit but not by much.

AIC(cpooling, m1_lme, m2_lme)

##          df      AIC
## cpooling  3 2445.332
## m1_lme    4 2427.166
## m2_lme    6 2429.655

Adding Company Location covariate

Adding the company location tothe random effect component.

m3_lme <- lme(rating ~ bean_type + cocoa_perc, data = chocolate, random = ~ bean_type|company_location, method = "ML", control = lmeControl(opt = 'optim'))
summary(m3_lme)

## Linear mixed-effects model fit by maximum likelihood
##  Data: chocolate 
##        AIC      BIC    logLik
##   2383.402 2421.852 -1184.701
## 
## Random effects:
##  Formula: ~bean_type | company_location
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev     Corr  
## (Intercept) 0.07889094 (Intr)
## bean_type   0.08328158 0.288 
## Residual    0.46463319       
## 
## Fixed effects: rating ~ bean_type + cocoa_perc 
##                 Value  Std.Error   DF  t-value p-value
## (Intercept)  4.082871 0.12758403 1733 32.00143  0.0000
## bean_type   -0.050954 0.04401439 1733 -1.15766  0.2472
## cocoa_perc  -0.012313 0.00175902 1733 -6.99983  0.0000
##  Correlation: 
##            (Intr) bn_typ
## bean_type   0.010       
## cocoa_perc -0.986 -0.035
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -4.67797829 -0.65855819  0.09671923  0.69061773  3.66075574 
## 
## Number of Observations: 1795
## Number of Groups: 60

Cross-level Interaction

Here I run a cross level interaction with bean type and the cocoa percentage. It shows that the new variable shows no significance because the p-value is greater tha -1.9. The forastero bean on average has a higher rating than the non-forastero bean with 1.02 rating. With a one percentage increase of ocoa percentage, reduces the forastero difference by .01 points.

m4_lme <- lme(rating ~ bean_type*cocoa_perc, data = chocolate, random = ~ bean_type|company_location, method = "ML", control = lmeControl(opt = 'optim'))
summary(m4_lme)

## Linear mixed-effects model fit by maximum likelihood
##  Data: chocolate 
##        AIC      BIC    logLik
##   2375.267 2419.209 -1179.633
## 
## Random effects:
##  Formula: ~bean_type | company_location
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev     Corr  
## (Intercept) 0.07955581 (Intr)
## bean_type   0.11986931 -0.22 
## Residual    0.46305462       
## 
## Fixed effects: rating ~ bean_type * cocoa_perc 
##                          Value Std.Error   DF   t-value p-value
## (Intercept)           3.895861 0.1400503 1732 27.817589  0.0000
## bean_type             1.023106 0.3327025 1732  3.075139  0.0021
## cocoa_perc           -0.009688 0.0019369 1732 -5.001840  0.0000
## bean_type:cocoa_perc -0.014714 0.0045383 1732 -3.242165  0.0012
##  Correlation: 
##                      (Intr) bn_typ cc_prc
## bean_type            -0.415              
## cocoa_perc           -0.988  0.414       
## bean_type:cocoa_perc  0.419 -0.989 -0.424
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -4.67205654 -0.67399764  0.07629979  0.70376802  3.66912522 
## 
## Number of Observations: 1795
## Number of Groups: 60

Best Model

After running the AIC, I can see that our last model is the best model with the lowest AIC value.

AIC(cpooling, m1_lme, m2_lme, m3_lme, m4_lme)

##          df      AIC
## cpooling  3 2445.332
## m1_lme    4 2427.166
## m2_lme    6 2429.655
## m3_lme    7 2383.402
## m4_lme    8 2375.267

Centering Covariates

chocolate %<>% mutate(ccocoa_perc = cocoa_perc - mean(cocoa_perc))

Refiting the Model

In the refiting model, the forastero bean difference is 1.02. If the chocolate bar has 50% cocoa, the estimated forastero bean difference is -.03.

The first model shows significance between bean type and rating, with the average rating of 3.9. With each level of increase of cocoa percentage, there is a decrease in rating of 0.01. The opposite goes for the bean type of forastero, the bean type difference shows that those that use the forastero bean has on average a 0.01 points less than those with non-forastero beans.

The second model shows significance in the cocoa percentage as well as the interaction term of bean type and cocoa percentage.

m4a_lme <- lme(rating ~ bean_type*ccocoa_perc, data = chocolate, random = ~ bean_type|company_location, method = "ML", control = lmeControl(opt = 'optim'))
summary(m4a_lme)

Linear mixed-effects model fit by maximum likelihood Data: chocolate AIC BIC logLik 2375.267 2419.209 -1179.633

Random effects: Formula: ~bean_type | company_location Structure: General positive-definite, Log-Cholesky parametrization StdDev Corr
(Intercept) 0.07955581 (Intr) bean_type 0.11986931 -0.22 Residual 0.46305462

Fixed effects: rating ~ bean_type * ccocoa_perc Value Std.Error DF t-value p-value (Intercept) 3.201224 0.02161506 1732 148.10155 0.0000 bean_type -0.031883 0.04907925 1732 -0.64962 0.5160 ccocoa_perc -0.009688 0.00193690 1732 -5.00184 0.0000 bean_type:ccocoa_perc -0.014714 0.00453829 1732 -3.24216 0.0012 Correlation: (Intr) bn_typ ccc_pr bean_type -0.294
ccocoa_perc 0.023 -0.005
bean_type:ccocoa_perc -0.014 -0.075 -0.424

Standardized Within-Group Residuals: Min Q1 Med Q3 Max -4.67205654 -0.67399764 0.07629979 0.70376802 3.66912522

Number of Observations: 1795 Number of Groups: 60

htmlreg(list(m4_lme, m4a_lme), doctype = FALSE)

Statistical models
	Model 1	Model 2
(Intercept)	3.90^***	3.20^***
	(0.14)	(0.02)
bean_type	1.02^**	-0.03
	(0.33)	(0.05)
cocoa_perc	-0.01^***
	(0.00)
bean_type:cocoa_perc	-0.01^**
	(0.00)
ccocoa_perc		-0.01^***
		(0.00)
bean_type:ccocoa_perc		-0.01^**
		(0.00)
AIC	2375.27	2375.27
BIC	2419.21	2419.21
Log Likelihood	-1179.63	-1179.63
Num. obs.	1795	1795
Num. groups	60	60
p < 0.001, p < 0.01, p < 0.05

Intra-Company correlation

m0_lme <- lme(rating ~ 1, random = ~ 1|company_location, data = chocolate, method = "ML", control = lmeControl(opt = 'optim'))
summary(m0_lme)

## Linear mixed-effects model fit by maximum likelihood
##  Data: chocolate 
##        AIC      BIC    logLik
##   2428.442 2444.921 -1211.221
## 
## Random effects:
##  Formula: ~1 | company_location
##         (Intercept)  Residual
## StdDev:  0.09025008 0.4717564
## 
## Fixed effects: rating ~ 1 
##                Value  Std.Error   DF  t-value p-value
## (Intercept) 3.190693 0.02222047 1735 143.5925       0
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -4.5751592 -0.6574433  0.1189825  0.7304919  3.6360974 
## 
## Number of Observations: 1795
## Number of Groups: 60

intervals(m0_lme)

## Approximate 95% confidence intervals
## 
##  Fixed effects:
##                lower     est.    upper
## (Intercept) 3.147124 3.190693 3.234263
## attr(,"label")
## [1] "Fixed effects:"
## 
##  Random Effects:
##   Level: company_location 
##                      lower       est.     upper
## sd((Intercept)) 0.05557016 0.09025008 0.1465728
## 
##  Within-group standard error:
##     lower      est.     upper 
## 0.4564343 0.4717564 0.4875929

Homework 8

Jessica Yin

April 6, 2019