Protein taste and Covid-19

Abstract

One sign of Covid-19 is the loss of taste and smell, also known as anosmia. To test the loss of taste reviews of protein powder were scraped from Amazon.com and parsed for the proportion of tasteless or no taste comments before and during 2020. A significance level of 0.01 was chosen and a sample size of 3465 reviews were collected. Through hypothesis testing a pvalue of 1.12e-5 was discovered from both a z-test and chi-squared test. Such a low p-val leads this report to conclude there was a significant change in reviews during 2020 thus the pandemic might play a role. Therefore this report rejects the null hypothesis and accepts the alternative hypothesis. Stated differently, there is strong evidence that the distribution of tastelss reviews and tasteful reviews before and during 2020 come from different distributions.

Introduction

Harvard professor Kate Petrova has analyzed sense of smell thorough reviews on scented candles. This report will expand on Professor Petrova research by look at how the sense of taste has changed since the first confirmed cased of Covid-19. Studies suggest a loss in taste or smell better predicts presence of Covid-19 than other well-known symptoms such as fever and cough.

The analysis will be done on the protein powders below

B000QSNYGI - Optimum Nutrition Gold Standard 100% Whey Protein Powder, Double Rich Chocolate, 5 Pound (Packaging May Vary)

B00BEOHFKO - Whey Protein Powder | MuscleTech Phase8 Protein Powder | Whey & Casein Protein Powder Blend | Slow Release 8-Hour Protein Shakes | Muscle Builder for Men & Women | Chocolate, 4.6 lbs (50 Servings)

B00NLR1PX0 - Vital Proteins Collagen Peptides Powder Supplement (Type I, III) for Skin Hair Nail Joint - Hydrolyzed Collagen - Non-GMO - Dairy and Gluten Free - 20g per Serving - Unflavored 10 oz Canister

B00QQA0H3S - Optimum Nutrition Gold Standard 100% Whey Protein Powder, Naturally Flavored Vanilla, 4.8 Pound (Packaging May Vary)

B01NAEHLFO - Garden of Life Sport Certified Grass Fed Clean Whey Protein Isolate, Vanilla, 22.57 Ounce

B06XX78LKR - Dymatize ISO 100 Whey Protein Powder with 25g of Hydrolyzed 100% Whey Isolate, Gluten Free, Fast Digesting, 1.6 Pound, Chocolate Peanut Butter, 25.6 Ounce (Pack of 1)

B00VZ0IoY8 - Dymatize ISO 100 Whey Protein Powder with 25g of Hydrolyzed 100% Whey Isolate, Gluten Free, Fast Digesting, Brown, Cinnamon Bun, 5 lbs

B07BL695D1 - Nutricost Whey Protein Isolate (Strawberry) 5LBS

B07L91HBFG - Nutrition Chocolate Milkshake Protein Powder, High Protein, Low Carb, Gluten Free, Soy Free, 1.6 lbs (Pack of 1)

B082TTFF87 - KOS Organic Plant Based Protein Powder, Chocolate Peanut Butter - Delicious Vegan Protein Powder - Keto Friendly, Gluten Free, Dairy Free & Soy Free - 1.3 Pounds, 15 Servings

This is an observational study as none of the items were actually purchased and no experiment was conducted.

Data collection

All of the data in this report were scraped from Amazon.com through product ID. The list above shows the mapping between product ID and protein powder. Ten protein powder were select through random searching and powders that author have sampled as well. To verify the reviews replace {ID} in the link below with one of the ids above.

link: “https://www.amazon.com/product-reviews/(ID)”

Data Preprocessing

The html was parsed for - dates - titles - reviews - starts - page number

df = data.frame()

for (i in dfiles) {
  pdf <-  read.csv(paste0(path, i))  
  
  titles = gather(pdf %>% select(starts_with('review_title')))$value
  reviews = gather(pdf %>% select(starts_with('review_text')))$value
  stars = gather(pdf %>% select(starts_with('review_star')))$value
  dates = gather(pdf %>% select(starts_with('review_date')))$value
  pages = gather(pdf %>% select(starts_with('page')))$value
  
  ndf <- data.frame(titles=titles, reviews=reviews, stars=stars, dates=dates, pages=pages) %>%  
          mutate(stars=as.numeric(str_extract(stars, '\\d')),
                 date=str_remove_all(dates, 'Reviewed in the United States on ') %>% as.Date(format = "%b %d, %Y"),
                 years=as.numeric(format(date, format='%Y')),
                 month=as.numeric(format(date, format='%m')),
                 days=as.numeric(format(date, format='%d')),
                 titles=gsub("[\r\n]", "", titles) %>% str_trim(side='both'),
                 protein=str_split(i, '.csv')[[1]][1],
                 reviews=str_to_lower(reviews) %>% gsub(pattern='[[:punct:] ]+', replace=' ') %>% gsub(pattern="[\r\n]", replace=' ') %>% str_trim(side='both'))
  
  df <- bind_rows(df, ndf)
}

Reviews required some extra cleaning by removing punctuation and stop words. An iterative process of reading reviews, then updating the stopword list, then reruning and rereading was conducted to clean up the reviews.

stop_word_list <- c(stopwords(), 'aarp', 'ab', 'azo', 'aug', 'asap', 'just', 'now', 'result', 'try', 'back', 'still', 'know', 'see', 'sure', 'didn', 'got', 'far', 'doesn', 'put', 'can', 'll', 've', 'don', 'much', 'go', 'take', 'use', 'since', 'get', 'always', 'has', 'aa', 'aas', 'will', 'get', 'also')

frequent_words = c('protein', 'product')
df <- df %>% filter(years > 2016 & years <= 2020) %>% 
             mutate(reviews=gsub(pattern='\\b[a-z]\\b{1}', replace=' ', reviews) %>%
                      gsub(pattern='[[:digit:]]+', replace='') %>%
                      removeWords(stop_word_list) %>%
                      str_trim())

Exploratory Data Analysis

With the data clean we can look at the sentiment in reviews

Sentimental Analysis

words = df %>% unnest_tokens(word, reviews) %>% count(word, name='word_count')

words %>% filter(word_count > 500 & !(word %in% frequent_words)) %>% mutate(word=reorder(word, word_count)) %>% ggplot(aes(x=word, y=word_count)) + geom_col() + coord_flip()

Common words like taste and flavor are used the most as expected, unusual words like hair and collagen were also used. After reading some of the hair reviews, the users stated how the protein helped their hair.

words %>% inner_join(get_sentiments('afinn')) %>%
          inner_join(get_sentiments('nrc')) %>%
          inner_join(get_sentiments('bing')) %>%
          acast(word ~ sentiment, value.var='word_count', fill=0) %>% 
  comparison.cloud(colors=c('red', 'blue'), max.words = 50)

## Joining, by = "word"
## Joining, by = "word"

## Joining, by = c("word", "sentiment")

We see nasty and bad in the negative set and good and perfect in the positive set.

Finally to prepare the data set for analysis we need specific words/phrases that would correspond to the powder having a poor taste or similar. Reviews that contain the tasteless_words will be labels as 1 in the notaste column else they will be labeled as 0.

tasteless_words = c('trash', 
                'disgusting', 
                'horrible', 
                'nasty', 
                'terrible', 
                'abomination', 
                'horrid', 
                'bad', 
                'tasteless', 
                'awful', 
                'flavor awful',
                'gagging',
                'flavor horrible',
                'terrible tasting ',
                'tasted too disgusting',
                'tastes disgusting',
                'taste horrible',
                'tasted horrible',
                'taste horrible',
                'tastes bad',
                'taste unbearable',
                'taste like cheap whey protein',
                'tastes nothing like',
                'taste isnt great',
                'worst after taste',
                'hate taste',
                'worst tasting protein',
                'medicinal and artificial taste',
                'no flavor',
                'not very good taste')

tasteless_words = paste(tasteless_words, sep='|', collapse ='|')[1]

df <- df %>% mutate(notaste = ifelse(str_detect(reviews, tasteless_words), 1, 0),
                    covidyear = ifelse(years <2020, 0, 1))

df %>% group_by(stars) %>% summarise(notaste=sum(notaste), n= n()) %>% mutate(taste=n-notaste) %>% select(stars, taste, notaste) %>% pivot_longer(cols = -stars) %>% ggplot(aes(x=stars, y=value, fill=name)) + geom_col() + labs(title='Reviews by stars', x='Stars', y='Count')

The rating distribution for protein powder is similar to reviews for all product. A parabolic distribution where most comments are on the ends (1,5) and less comments in the middle (2,3,4). This distribution is expected since those who really like the powder and really hate it will give a review, while those in the middle are less likely to review.

df %>% group_by(years) %>% summarise(notaste=sum(notaste), n= n()) %>% mutate(taste=n-notaste) %>% select(years, taste, notaste) %>% pivot_longer(cols = -years) %>% ggplot(aes(x=years, y=value, fill=name)) + geom_col() + labs(title='Reviews by years', x='Years', y='Count')

The number of reviews are constantly increasing from 2017 to 2020. This could be because more people are adding protein powder in their every day life. The amount of notaste comments is the highest in 2020 but the over count is also bigger.

df %>% ggplot(aes(x=protein)) + geom_bar() + coord_flip() + facet_wrap(~notaste, labeller = labeller(notaste =c("0"="Tastefull comments", "1"="Tasteless comments")))

We can see most powders are well represented with the exception of B00QQA0H3S; however that is fine since B00QQA0H3S and B000QSNYGI are the same brand but different flavors.

df  %>% filter(notaste == 1) %>% 
  arrange(date) %>%
  group_by(date) %>%
  summarise(Rating = mean(stars)) %>% ggplot(aes(x = (as.Date(date)), y = Rating)) +
  geom_vline(xintercept = as.numeric(as.Date("2020-01-20")), colour = "red", linetype = "dashed")+
  geom_smooth(method = "loess", size = 1.5, colour = "blueviolet", fill = "blueviolet") +
  geom_point(alpha = 0.5, colour = "chocolate1") +
  geom_text(x=as.Date("2020-05-30"), y=1.5, label='First confirmed\nCovid-19 case', color='black') +
  geom_text(x=as.Date("2017-05-10"), y=1.25, label='N = 570', color='black') +
  labs(x = "Date", y = "Average daily rating (1-5)", title = "Protein Powder Amazon reviews 2017-2020", subtitle = 'Reviews containing lack of taste')+
  theme(plot.title = element_text(size=16)) + 
  scale_x_date(date_labels = "%m-%Y")

## `geom_smooth()` using formula 'y ~ x'

Before the first case of Covid-19 the regression line is slightly sloping down; however, after the first confirmed case the line strangely goes up. That motion states there are more positive reviews and those reviews contain a lot of comments on how the powder is lacking taste.

df  %>% filter(notaste == 0) %>% 
  arrange(date) %>%
  group_by(date) %>%
  summarise(Rating = mean(stars)) %>% ggplot(aes(x = (as.Date(date)), y = Rating)) +
  geom_vline(xintercept = as.numeric(as.Date("2020-01-20")), colour = "red", linetype = "dashed")+
  geom_smooth(method = "loess", size = 1.5, colour = "blueviolet", fill = "blueviolet") +
  geom_point(alpha = 0.5, colour = "chocolate1") +
  geom_text(x=as.Date("2020-05-10"), y=1.5, label='First confirmed\nCovid-19 case', color='red') +
  geom_text(x=as.Date("2017-05-10"), y=1.25, label='N = 2895', color='black') +
  labs(x = "Date", y = "Average daily rating (1-5)", title = "Protein Powder Amazon reviews 2017-2020", subtitle = 'Reviews containing normal taste')+
  theme(plot.title = element_text(size=16)) + 
  scale_x_date(date_labels = "%m-%Y")

## `geom_smooth()` using formula 'y ~ x'

Similar to the last plot, the regression line has a slight negative slope and after the first confirmed case the slope increases.

The last 2 plots so a striking trend to our question, notaste comments are on a rise and tastefull comments are on a decline.

df %>%
  filter(years == 2020) %>%
  group_by(month) %>%
  add_tally() %>%
  summarise(n =n, notaste = sum(notaste)) %>%
  mutate(nsprop = notaste/n) %>%
  mutate(se = sqrt((nsprop*(1-nsprop))/n)) %>%
  summarise(n=mean(n), se=mean(se), nsprop=mean(nsprop)) %>% 
  ggplot(aes(x=as.factor(month), y = nsprop, group = month))+
  geom_bar(stat = "identity", fill = "lightseagreen")+
  geom_errorbar(aes(ymin = (nsprop-se), ymax = (nsprop+se)), width=0.2, colour = "gray30")+
  labs(x = "Month", y = "Proportion of reviews", title = "Proportion of tasteless reviews", subtitle="Year 2020")+
  theme_light()+
  theme(plot.title = element_text(size=16))

## `summarise()` has grouped output by 'month'. You can override using the `.groups` argument.

The start of the year has the highest proportion then a drop and a rise that peaks in october, then the proportion falls again.

Hypothesis testing

\[ H_o: \mu_{tasteless-before-2020} = \mu_{tasteless-during-2020} \\ Ha: \mu_{tasteless-before-2020} \ne \mu_{tasteless-during-2020}\] Type 1 Error : Rejecting the null hypothesis when H_o is actually true

Type 2 Error : Failing to reject the null hypothesis when H_a is actually true

Before running any analysis we must agree on an alpha 0.01.

Z-test

A Z-test is a type of inferential statistic used to determine if there is a significant difference between the means of two groups, which may be related in certain features. The sample is over 3000 which is significantly greater than 30 so a t-test would converge to a z-distribution. Before using the z-test certain condition need to be met

Success-failure condition

There should be at least 10 expected successes and 10 expected failures in a sample in order to use the normal distribution as an approximation.

\(n * (1-p) \ge 10; \quad n*p \ge10\).

There are 3465 samples and 0.16 expected successes. Substituting gives 554.4 which is well above the required amount to use a normal distribution.

Independence test

Another test is the assumption that the data came from a simple random sample and collected from a representative, randomly selected portion of the total population. I see no reason to believe the reviews have any meaningful influence with one another; however there is no way to prove that.

Confidence interval

With the conditions met we can create a confidence interval of the true population proportions

z = 1.96
se = sqrt(p1*(1-p1)/n1 + p1*(1-p2)/n2)
lower_conf <- pdiff - z * se
upper_conf <- pdiff + z * se

me1 = round(sqrt(p1*(1-p1)/n1), 3)
me2 = round(sqrt(p2*(1-p2)/n2), 3)

We can say with 95% confidence that a sample of 3465 before 2020 would have a difference in means between -0.0783735 and -0.0328514 with 0.008% error for before covid and 0.01% during covid.

se = sqrt(p_pool * (1- p_pool) / n1 + p_pool * (1- p_pool) / n2)$p_pool
pe = df_table %>% filter(covidyear != 'Total') %>% mutate(prop=Tasteless/Total) %>% pull(prop) %>% diff()

z = (pe/se)
pval <- (1 - pnorm(z)) * 2
print(pval)

## [1] 1.112243e-05

The pval for the two-tailed z-test is 0.000012 which is very small leading us to reject the null hypothesis and accept the alternative hypothesis.

Chi-squared tests

The Chi-square goodness of fit test checks whether your sample data is likely to be from a specific theoretical distribution. We have a set of data values, and an idea about how the data values are distributed. The test gives us a way to decide if the data values have a “good enough” fit to our \(H_a\), or if the \(H_o\) cannot be rejected.

\[ X^2 = \sum \frac{(Observed - Expected)^2}{Expected}\]

Covid Year	Tastefull Reviews	Tasteless Reviews	Total
Before Covid	1636	265	1901
During Covid	1259	305	1564
Total	2895	570	3465

df_table %>%  slice_head(n=2) %>% select('Tastefull Reviews', 'Tasteless Reviews') %>% chisq.test(correct=F)

## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 19.308, df = 1, p-value = 1.112e-05

With an \(X^2\) of 19.3 that corresponds to a p-value of 0.0000112 which is well beyond on alpha of 0.01 The \(X^2\) test further in forces our approach to accept the alternative of reject the null hypothesis

## Rows: 3,465
## Columns: 2
## $ covidyear <chr> "During Covid", "Before Covid", "Before Covid", "Before Covi…
## $ taste     <chr> "Tasteless", "Tasteless", "Tastefull", "Tastefull", "Tastefu…

Limitations and future work

This report only covered protein powder while there are a lot of other products that people consume where taste is an important factor. This research can be extended by sampling more protein powder reviews and verifying the conclusions here. The definition of tasteless can also be adjust to account for more or less reviews. Even though biggest event of 2020 was Covid-19 the results stated here only correlate to Covid-19 being the reason but that’s not explicitly proven. Further research can sample only those who are positive with Covid-19 and get their reviews.

Conclusion

According to insider.fitt.co about two in five (46%) Americans say they regularly consume protein drinks and shakes. Also according to the CDC about 30 million (10% of the population) Americans were infected by Covid-19. Thus it is likely a certain percent of Americans were positive for Covid-19 when they tried these protein powders and suffered from anosmia. As stated in the limitations this isn’t a casual result but one stating a strong correlation.