The dataset I used for this analysis was found on kaggle.com with 1795 observations. Variables that will be focused on will be where the cocoa bean orginated from, where the company is located, and the rating of the chocolate bar itself on a scale from 0-5. I also wanted to focus on one type of bean which is the more common one being used in the production of chocolate bars, Forastero and changed that variable into a binomial variable, 1 if it used , even partially, and 0 if it did not. I assigned a unique number to each different observation in the company loaction variable. There is a total of 60 different unique observations in that one variable.
library(nlme)
library(dplyr)
library(magrittr)
library(tidyr)
library(haven)
library(lmerTest)
library(ggplot2)
library(texreg)
chocolate<- read.csv("C:/Users/Jessica/Desktop/712/chocolate.csv")
head(chocolate)
## company REF review_date cocoa_dec cocoa_perc company_location rating
## 1 A. Morin 1876 2016 0.63 63 France 3.75
## 2 A. Morin 1676 2015 0.70 70 France 2.75
## 3 A. Morin 1676 2015 0.70 70 France 3.00
## 4 A. Morin 1680 2015 0.70 70 France 3.50
## 5 A. Morin 1704 2015 0.70 70 France 3.50
## 6 A. Morin 1315 2014 0.70 70 France 2.75
## bean_type origin
## 1 other Sao Tome
## 2 other Togo
## 3 other Togo
## 4 other Togo
## 5 other Peru
## 6 Criollo Venezuela
chocolate<-chocolate%>%
mutate(bean_type=ifelse(bean_type =="Forastero", "Forastero","other"))
chocolate<-chocolate%>%
mutate(bean_type=ifelse(bean_type == "Forastero" , 1, 0))
length(unique(chocolate$company_location))
## [1] 60
chocolate %>%
group_by(company_location) %>%
summarise(n_sch = n())
## # A tibble: 60 x 2
## company_location n_sch
## <fct> <int>
## 1 Amsterdam 4
## 2 Argentina 9
## 3 Australia 49
## 4 Austria 26
## 5 Belgium 40
## 6 Bolivia 2
## 7 Brazil 17
## 8 Canada 125
## 9 Chile 2
## 10 Colombia 23
## # ... with 50 more rows
Here I ran an ecological analysis with each individual country, their average rating of chocolate and whether it used the Forastero bean.
Here we see a negative proportion between the forastero bean and the rating score. A non-forastero bean has a higher scoring rating than the forastero bean.
usa <- chocolate %>%
group_by(company_location) %>%
summarise(mean_p = mean(rating, na.rm = TRUE), mean_s = mean(bean_type, na.rm = TRUE))
head(usa)
## # A tibble: 6 x 3
## company_location mean_p mean_s
## <fct> <dbl> <dbl>
## 1 Amsterdam 3.5 0.25
## 2 Argentina 3.31 0.444
## 3 Australia 3.36 0.0816
## 4 Austria 3.24 0.154
## 5 Belgium 3.09 0.2
## 6 Bolivia 3.25 0
bean <- lm(mean_p ~ mean_s, data = usa)
summary(bean)
##
## Call:
## lm(formula = mean_p ~ mean_s, data = usa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.64219 -0.13000 0.05544 0.18868 0.60781
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.14219 0.03857 81.477 <2e-16 ***
## mean_s -0.06955 0.13831 -0.503 0.617
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2586 on 58 degrees of freedom
## Multiple R-squared: 0.004341, Adjusted R-squared: -0.01283
## F-statistic: 0.2528 on 1 and 58 DF, p-value: 0.617
Here I ran the complete pooling model, treating each company location as the same with no difference. Because there are 60 different locations used in the variable, there will be 60 different regression models, one for each location.
The results of 3.19 and -0.17 for the intercept and bean type shows that there are many countervailing forces.
cpooling <- lm(rating ~ bean_type, data = chocolate)
summary(cpooling)
##
## Call:
## lm(formula = rating ~ bean_type, data = chocolate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.19408 -0.28234 0.05592 0.30592 1.80592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.19408 0.01196 267.155 <2e-16 ***
## bean_type -0.07348 0.03591 -2.046 0.0409 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4776 on 1793 degrees of freedom
## Multiple R-squared: 0.00233, Adjusted R-squared: 0.001773
## F-statistic: 4.187 on 1 and 1793 DF, p-value: 0.04088
The graph shows that there is a difference of rating scores between each country. The range shown is the country specific average rating for non-forastero beans ranging from 2.5-4.
dcoef <- chocolate %>%
group_by(company_location) %>%
do(mod = lm(rating ~ bean_type, data = .))
coef <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coef, aes(x = intc)) + geom_histogram()
The distribution in Forastero bean difference shows that non-forastero beans have a higher rating.
dcoef <- chocolate %>%
group_by(company_location) %>%
do(mod = lm(rating ~ bean_type, data = .))
coef <- dcoef %>% do(data.frame(sexc = coef(.$mod)[2]))
ggplot(coef, aes(x = sexc)) + geom_histogram()
m1_lme <- lme(rating ~ bean_type, data = chocolate, random = ~1|company_location, method = "ML")
summary(m1_lme)
## Linear mixed-effects model fit by maximum likelihood
## Data: chocolate
## AIC BIC logLik
## 2427.166 2449.137 -1209.583
##
## Random effects:
## Formula: ~1 | company_location
## (Intercept) Residual
## StdDev: 0.08808265 0.4714271
##
## Fixed effects: rating ~ bean_type
## Value Std.Error DF t-value p-value
## (Intercept) 3.200117 0.02252802 1734 142.05052 0.0000
## bean_type -0.066493 0.03668728 1734 -1.81243 0.0701
## Correlation:
## (Intr)
## bean_type -0.223
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -4.5518922 -0.6483808 0.1058487 0.7200120 3.6244885
##
## Number of Observations: 1795
## Number of Groups: 60
.07 estimated standard deviation of the theoretical normal distriubtion which covers the between company location change in the rating country change. .06 is the bean type slope..647 shows the between the two random components, in other words where the non-foraster bean is rated high, the bean type difference rating is also high.
m2_lme <- lme(rating ~ bean_type, data = chocolate, random = ~ bean_type|company_location, method = "ML", control = lmeControl(opt = 'optim'))
summary(m2_lme)
## Linear mixed-effects model fit by maximum likelihood
## Data: chocolate
## AIC BIC logLik
## 2429.655 2462.612 -1208.828
##
## Random effects:
## Formula: ~bean_type | company_location
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.07896590 (Intr)
## bean_type 0.06872798 0.647
## Residual 0.47109111
##
## Fixed effects: rating ~ bean_type
## Value Std.Error DF t-value p-value
## (Intercept) 3.201326 0.02117919 1734 151.15428 0.0000
## bean_type -0.060216 0.04169026 1734 -1.44436 0.1488
## Correlation:
## (Intr)
## bean_type -0.059
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -4.5764691 -0.6403946 0.1001228 0.7177037 3.6529112
##
## Number of Observations: 1795
## Number of Groups: 60
In the model selection, the second model is the better fit but not by much.
AIC(cpooling, m1_lme, m2_lme)
## df AIC
## cpooling 3 2445.332
## m1_lme 4 2427.166
## m2_lme 6 2429.655
Adding the company location tothe random effect component.
m3_lme <- lme(rating ~ bean_type + cocoa_perc, data = chocolate, random = ~ bean_type|company_location, method = "ML", control = lmeControl(opt = 'optim'))
summary(m3_lme)
## Linear mixed-effects model fit by maximum likelihood
## Data: chocolate
## AIC BIC logLik
## 2383.402 2421.852 -1184.701
##
## Random effects:
## Formula: ~bean_type | company_location
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.07889094 (Intr)
## bean_type 0.08328158 0.288
## Residual 0.46463319
##
## Fixed effects: rating ~ bean_type + cocoa_perc
## Value Std.Error DF t-value p-value
## (Intercept) 4.082871 0.12758403 1733 32.00143 0.0000
## bean_type -0.050954 0.04401439 1733 -1.15766 0.2472
## cocoa_perc -0.012313 0.00175902 1733 -6.99983 0.0000
## Correlation:
## (Intr) bn_typ
## bean_type 0.010
## cocoa_perc -0.986 -0.035
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -4.67797829 -0.65855819 0.09671923 0.69061773 3.66075574
##
## Number of Observations: 1795
## Number of Groups: 60
Here I run a cross level interaction with bean type and the cocoa percentage. It shows that the new variable shows no significance because the p-value is greater tha -1.9. The forastero bean on average has a higher rating than the non-forastero bean with 1.02 rating. With a one percentage increase of ocoa percentage, reduces the forastero difference by .01 points.
m4_lme <- lme(rating ~ bean_type*cocoa_perc, data = chocolate, random = ~ bean_type|company_location, method = "ML", control = lmeControl(opt = 'optim'))
summary(m4_lme)
## Linear mixed-effects model fit by maximum likelihood
## Data: chocolate
## AIC BIC logLik
## 2375.267 2419.209 -1179.633
##
## Random effects:
## Formula: ~bean_type | company_location
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.07955581 (Intr)
## bean_type 0.11986931 -0.22
## Residual 0.46305462
##
## Fixed effects: rating ~ bean_type * cocoa_perc
## Value Std.Error DF t-value p-value
## (Intercept) 3.895861 0.1400503 1732 27.817589 0.0000
## bean_type 1.023106 0.3327025 1732 3.075139 0.0021
## cocoa_perc -0.009688 0.0019369 1732 -5.001840 0.0000
## bean_type:cocoa_perc -0.014714 0.0045383 1732 -3.242165 0.0012
## Correlation:
## (Intr) bn_typ cc_prc
## bean_type -0.415
## cocoa_perc -0.988 0.414
## bean_type:cocoa_perc 0.419 -0.989 -0.424
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -4.67205654 -0.67399764 0.07629979 0.70376802 3.66912522
##
## Number of Observations: 1795
## Number of Groups: 60
After running the AIC, I can see that our last model is the best model with the lowest AIC value.
AIC(cpooling, m1_lme, m2_lme, m3_lme, m4_lme)
## df AIC
## cpooling 3 2445.332
## m1_lme 4 2427.166
## m2_lme 6 2429.655
## m3_lme 7 2383.402
## m4_lme 8 2375.267
chocolate %<>% mutate(ccocoa_perc = cocoa_perc - mean(cocoa_perc))
In the refiting model, the forastero bean difference is 1.02. If the chocolate bar has 50% cocoa, the estimated forastero bean difference is -.03.
The first model shows significance between bean type and rating, with the average rating of 3.9. With each level of increase of cocoa percentage, there is a decrease in rating of 0.01. The opposite goes for the bean type of forastero, the bean type difference shows that those that use the forastero bean has on average a 0.01 points less than those with non-forastero beans.
The second model shows significance in the cocoa percentage as well as the interaction term of bean type and cocoa percentage.
m4a_lme <- lme(rating ~ bean_type*ccocoa_perc, data = chocolate, random = ~ bean_type|company_location, method = "ML", control = lmeControl(opt = 'optim'))
summary(m4a_lme)
Linear mixed-effects model fit by maximum likelihood Data: chocolate AIC BIC logLik 2375.267 2419.209 -1179.633
Random effects: Formula: ~bean_type | company_location Structure: General positive-definite, Log-Cholesky parametrization StdDev Corr
(Intercept) 0.07955581 (Intr) bean_type 0.11986931 -0.22 Residual 0.46305462
Fixed effects: rating ~ bean_type * ccocoa_perc Value Std.Error DF t-value p-value (Intercept) 3.201224 0.02161506 1732 148.10155 0.0000 bean_type -0.031883 0.04907925 1732 -0.64962 0.5160 ccocoa_perc -0.009688 0.00193690 1732 -5.00184 0.0000 bean_type:ccocoa_perc -0.014714 0.00453829 1732 -3.24216 0.0012 Correlation: (Intr) bn_typ ccc_pr bean_type -0.294
ccocoa_perc 0.023 -0.005
bean_type:ccocoa_perc -0.014 -0.075 -0.424
Standardized Within-Group Residuals: Min Q1 Med Q3 Max -4.67205654 -0.67399764 0.07629979 0.70376802 3.66912522
Number of Observations: 1795 Number of Groups: 60
htmlreg(list(m4_lme, m4a_lme), doctype = FALSE)
| Model 1 | Model 2 | ||
|---|---|---|---|
| (Intercept) | 3.90*** | 3.20*** | |
| (0.14) | (0.02) | ||
| bean_type | 1.02** | -0.03 | |
| (0.33) | (0.05) | ||
| cocoa_perc | -0.01*** | ||
| (0.00) | |||
| bean_type:cocoa_perc | -0.01** | ||
| (0.00) | |||
| ccocoa_perc | -0.01*** | ||
| (0.00) | |||
| bean_type:ccocoa_perc | -0.01** | ||
| (0.00) | |||
| AIC | 2375.27 | 2375.27 | |
| BIC | 2419.21 | 2419.21 | |
| Log Likelihood | -1179.63 | -1179.63 | |
| Num. obs. | 1795 | 1795 | |
| Num. groups | 60 | 60 | |
| p < 0.001, p < 0.01, p < 0.05 | |||
m0_lme <- lme(rating ~ 1, random = ~ 1|company_location, data = chocolate, method = "ML", control = lmeControl(opt = 'optim'))
summary(m0_lme)
## Linear mixed-effects model fit by maximum likelihood
## Data: chocolate
## AIC BIC logLik
## 2428.442 2444.921 -1211.221
##
## Random effects:
## Formula: ~1 | company_location
## (Intercept) Residual
## StdDev: 0.09025008 0.4717564
##
## Fixed effects: rating ~ 1
## Value Std.Error DF t-value p-value
## (Intercept) 3.190693 0.02222047 1735 143.5925 0
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -4.5751592 -0.6574433 0.1189825 0.7304919 3.6360974
##
## Number of Observations: 1795
## Number of Groups: 60
intervals(m0_lme)
## Approximate 95% confidence intervals
##
## Fixed effects:
## lower est. upper
## (Intercept) 3.147124 3.190693 3.234263
## attr(,"label")
## [1] "Fixed effects:"
##
## Random Effects:
## Level: company_location
## lower est. upper
## sd((Intercept)) 0.05557016 0.09025008 0.1465728
##
## Within-group standard error:
## lower est. upper
## 0.4564343 0.4717564 0.4875929