In this week’s assignment I will use nutrition data of cereals from different cereal manufacturers to look at cereal ratings and whether the amounts of sugars and number of calories affect cereal ratings.
library(tidyverse)
library(tidyr)
library(readr)
library(dplyr)
library(visreg)
library(nlme)
library(sjmisc)
library(ggplot2)
library(radiant.data)
library(pander)
library(magrittr)
library(haven)
library(lmerTest)
library(texreg)
Cereal<- read_csv("C:/Users/Papa/Desktop/Soc 712 -R/cereal.csv", col_names = TRUE)
head(Cereal)
Cereal2 <- select(Cereal, name, mfr, calories, sugars, rating)
head(Cereal2)
For the analysis the following variables were selected cereal name, manufacturer, calories, sugars, and rating.
In the manufacturer variable A stands for American Home Food Products, G for General Mills, K for Kellogs, N for Nabisco,P for Post, Q for Quaker Oats and R for Ralston Purina. We convert this variable into a factor.
Cereal3<-Cereal2 %>% mutate (n_manufacturer = factor(ifelse(mfr == "A", 1,
ifelse(mfr == "G", 2,
ifelse(mfr == "K", 3,
ifelse(mfr == "N", 4,
ifelse(mfr == "P", 5,
ifelse(mfr == "Q", 6,
ifelse(mfr == "R", 7, "error")))))))))
head(Cereal3)
length(unique(Cereal3$n_manufacturer))
[1] 7
Cereal3 %>%
group_by(n_manufacturer) %>%
summarise(n_man = n())
manufacturers <- Cereal3 %>%
group_by(n_manufacturer) %>%
summarise(mean_r = mean(rating, na.rm = TRUE), mean_c = mean(calories, na.rm = TRUE), mean_s = mean(sugars, na.rm = TRUE))
head(manufacturers)
We generated the average rating, calories and sugars amounts per manufacturer.
cpooling <- lm(rating ~ calories, data = Cereal3)
summary(cpooling)
Call:
lm(formula = rating ~ calories, data = Cereal3)
Residuals:
Min 1Q Median 3Q Max
-18.7201 -7.9317 -0.6678 5.9902 23.4161
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 95.78802 6.55057 14.623 < 0.0000000000000002 ***
calories -0.49701 0.06031 -8.241 0.00000000000414 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 10.24 on 75 degrees of freedom
Multiple R-squared: 0.4752, Adjusted R-squared: 0.4682
F-statistic: 67.92 on 1 and 75 DF, p-value: 0.00000000000414
The output for the first complete pooling model shows that the average cereal rating (without accounting for differences among cereal manufacturers) is 95.79 if a cereal had 0 calories and that for each additional calorie the cereal rating decreases on average by 0.5. We see that the results are statistically significant at a 99% confidence level. To get a better look at rating differences we add for the next complete pooling model the amount of sugars into the model.
cpooling2 <- lm(rating ~ calories + sugars, data = Cereal3)
summary(cpooling2)
Call:
lm(formula = rating ~ calories + sugars, data = Cereal3)
Residuals:
Min 1Q Median 3Q Max
-15.643 -6.339 -1.221 4.823 23.413
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 84.11417 5.44513 15.448 < 0.0000000000000002 ***
calories -0.27644 0.05755 -4.804 0.00000793294 ***
sugars -1.71939 0.25225 -6.816 0.00000000216 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 8.083 on 74 degrees of freedom
Multiple R-squared: 0.6776, Adjusted R-squared: 0.6689
F-statistic: 77.78 on 2 and 74 DF, p-value: < 0.00000000000000022
The output for the second complete pooling model shows that the average cereal rating (without accounting for differences among cereal manufacturers) is 84.11 if a cereal had 0 calories and 0 grams sugar and that for each additional calorie the cereal rating decreases on average by 0.28 and for each additional sugar gram the cereal rating decreases on average by 1.72. Again, we see that the results are statistically significant at a 99% confidence level. Finally, we will look at rating differences with an interaction between calories and sugar amounts for the complete pooling model
cpooling3 <- lm(rating ~ calories*sugars, data = Cereal3)
summary(cpooling3)
Call:
lm(formula = rating ~ calories * sugars, data = Cereal3)
Residuals:
Min 1Q Median 3Q Max
-18.1761 -5.3793 0.1491 4.8486 15.7521
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 101.659340 7.560213 13.447 < 0.0000000000000002 ***
calories -0.454542 0.078219 -5.811 0.000000151 ***
sugars -5.038979 1.075503 -4.685 0.000012641 ***
calories:sugars 0.031056 0.009812 3.165 0.00226 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 7.631 on 73 degrees of freedom
Multiple R-squared: 0.7165, Adjusted R-squared: 0.7049
F-statistic: 61.51 on 3 and 73 DF, p-value: < 0.00000000000000022
In the third complete pooling model we see that on their own calories and sugar amounts are still statistically significant at a 99% confidence level, however the interaction between the two indepedent variables is only statistically significant at a 95% confidence level.
htmlreg(list(cpooling,cpooling2, cpooling3))
| Model 1 | Model 2 | Model 3 | ||
|---|---|---|---|---|
| (Intercept) | 95.79*** | 84.11*** | 101.66*** | |
| (6.55) | (5.45) | (7.56) | ||
| calories | -0.50*** | -0.28*** | -0.45*** | |
| (0.06) | (0.06) | (0.08) | ||
| sugars | -1.72*** | -5.04*** | ||
| (0.25) | (1.08) | |||
| calories:sugars | 0.03** | |||
| (0.01) | ||||
| R2 | 0.48 | 0.68 | 0.72 | |
| Adj. R2 | 0.47 | 0.67 | 0.70 | |
| Num. obs. | 77 | 77 | 77 | |
| RMSE | 10.24 | 8.08 | 7.63 | |
| p < 0.001, p < 0.01, p < 0.05 | ||||
When we compare the three complete pooling models we see that the third model is the best fitting model with an R2 of 0.72 (which is the highest amongst the three models).
dcoef <- Cereal3 %>%
group_by(n_manufacturer) %>%
do(mod = lm(rating ~ calories + sugars, data = .))
coef <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coef, aes(x = intc)) + geom_histogram()
Now we use the No pooling model to look at manufacturer differences in ratings. We see that large differences in cereal ratings among different manufacturers. The distribution looks somewhat normally distributed.
dcoef <- Cereal3 %>%
group_by(n_manufacturer) %>%
do(mod = lm(rating ~ calories, data = .))
coef <- dcoef %>% do(data.frame(calories = coef(.$mod)[2]))
ggplot(coef, aes(x = calories)) + geom_histogram()
When looking at rating differences with respect to calories we now see a slightly right skewed distribution as the number of calories increase.
m1_crating <- lme(rating ~ calories, data = Cereal3, random = ~1|n_manufacturer, method = "ML")
summary(m1_crating)
Linear mixed-effects model fit by maximum likelihood
Data: Cereal3
AIC BIC logLik
568.5035 577.8787 -280.2517
Random effects:
Formula: ~1 | n_manufacturer
(Intercept) Residual
StdDev: 6.374507 8.503316
Fixed effects: rating ~ calories
Value Std.Error DF t-value p-value
(Intercept) 91.86110 6.322652 69 14.528886 0
calories -0.44305 0.054472 69 -8.133517 0
Correlation:
(Intr)
calories -0.899
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.76923784 -0.74835203 0.05835578 0.72828773 2.78544717
Number of Observations: 77
Number of Groups: 7
m2_crating <- lme(rating ~ sugars, data = Cereal3, random = ~1|n_manufacturer, method = "ML")
summary(m2_crating)
Linear mixed-effects model fit by maximum likelihood
Data: Cereal3
AIC BIC logLik
550.0402 559.4154 -271.0201
Random effects:
Formula: ~1 | n_manufacturer
(Intercept) Residual
StdDev: 5.674263 7.540553
Fixed effects: rating ~ sugars
Value Std.Error DF t-value p-value
(Intercept) 58.96186 2.7849719 69 21.17144 0
sugars -2.17261 0.2126363 69 -10.21750 0
Correlation:
(Intr)
sugars -0.469
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.76232047 -0.66734610 0.08421924 0.50447067 4.42116697
Number of Observations: 77
Number of Groups: 7
m3_crating <- lme(rating ~ calories + sugars, data = Cereal3, random = ~1|n_manufacturer, method = "ML")
summary(m3_crating)
Linear mixed-effects model fit by maximum likelihood
Data: Cereal3
AIC BIC logLik
527.6888 539.4078 -258.8444
Random effects:
Formula: ~1 | n_manufacturer
(Intercept) Residual
StdDev: 5.195241 6.40292
Fixed effects: rating ~ calories + sugars
Value Std.Error DF t-value p-value
(Intercept) 81.93156 5.038979 68 16.259557 0
calories -0.25440 0.048437 68 -5.252119 0
sugars -1.58308 0.213238 68 -7.424006 0
Correlation:
(Intr) calors
calories -0.868
sugars 0.262 -0.520
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.85332392 -0.62335188 -0.06762691 0.59790039 3.57112139
Number of Observations: 77
Number of Groups: 7
m4_crating <- lme(rating ~ calories*sugars, data = Cereal3, random = ~ sugars|n_manufacturer, method = "ML")
summary(m4_crating)
Linear mixed-effects model fit by maximum likelihood
Data: Cereal3
AIC BIC logLik
518.1851 536.9356 -251.0926
Random effects:
Formula: ~sugars | n_manufacturer
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 5.44074510787 (Intr)
sugars 0.00005841409 0
Residual 5.72159189529
Fixed effects: rating ~ calories * sugars
Value Std.Error DF t-value p-value
(Intercept) 100.89399 6.624025 67 15.231522 0.0000
calories -0.44511 0.064296 67 -6.922701 0.0000
sugars -5.05465 0.877583 67 -5.759745 0.0000
calories:sugars 0.03222 0.007927 67 4.064667 0.0001
Correlation:
(Intr) calors sugars
calories -0.924
sugars -0.651 0.639
calories:sugars 0.708 -0.734 -0.976
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.0230039 -0.7014219 0.1141536 0.6372830 2.3697122
Number of Observations: 77
Number of Groups: 7
htmlreg(list(m1_crating, m2_crating, m3_crating, m4_crating))
| Model 1 | Model 2 | Model 3 | Model 4 | ||
|---|---|---|---|---|---|
| (Intercept) | 91.86*** | 58.96*** | 81.93*** | 100.89*** | |
| (6.32) | (2.78) | (5.04) | (6.62) | ||
| calories | -0.44*** | -0.25*** | -0.45*** | ||
| (0.05) | (0.05) | (0.06) | |||
| sugars | -2.17*** | -1.58*** | -5.05*** | ||
| (0.21) | (0.21) | (0.88) | |||
| calories:sugars | 0.03*** | ||||
| (0.01) | |||||
| AIC | 568.50 | 550.04 | 527.69 | 518.19 | |
| BIC | 577.88 | 559.42 | 539.41 | 536.94 | |
| Log Likelihood | -280.25 | -271.02 | -258.84 | -251.09 | |
| Num. obs. | 77 | 77 | 77 | 77 | |
| Num. groups | 7 | 7 | 7 | 7 | |
| p < 0.001, p < 0.01, p < 0.05 | |||||
Comparing the models, we see that the last model that accounts for manufacturer differences is the best fit model since it has the lowest AIC and BIC (with 518.19 and 536.94, respectively). Looking at the m4_crating model we see that there is a difference of 5.44 in cereal ratings when keeping the amoung of sugars constant at the higher-level (manufacturer level). We also see that there is a 5.72 rating difference after keeping cereal manufacturer and sugar amount constant.