This was a mock consulting report for my statistical consulting class. The client was our professor under the guise of a food science researcher. She came into our class as the client and we had to ask questions regarding her project and try and figure out how to do an appropriate analysis. The class was divided into multiple groups and each group was then responsible to come back with a proposal to analyze the client’s data. We presented the proposal to the client who asked for clarification about certain parts. After the presentation, each group were given a week to run the analysis using their proposal and came back and give a presentation and recommendations to the client. The report’s EDA is done in R and the analysis is done SAS. The original data was a mess of excel files and needed a lot of processing to make it into a workable dataframe.
RESEARCH FIELD: Marketing and Food Science
PROJECT TITLE: What makes a tasty french fry?
Kirsten from the marketing department (working in collaboration with the food science department) has asked us to analyze the French fry data. The data has been collected over the course of 5 days for 2 weeks (total of 10 days). Each week has 2 different brands of oil being used to cook the french fries. Also, each oil was aged from 1 to 5, i.e how many times french fries were cooked in the oil. The samples were collected in such a way that each person got to taste 2 oil brands (out of 4). The taster then had to rate their sample on taste, temperature, color, texture, appearance and overall liking from 1 to 9. The data provided is raw and large, hence we need to help her with the analysis stage of the study. The raw data was cleaned in such a way that we could best answer her research questions.
Kirsten has 2 research questions:
Does the brands of oil and the age of the oil have any effect on the overall liking of the French fries
Attributes of the French fries that are important and can be improved.
The first question implies that if brands are insignificant then while making the French fries the cheapest oil of the 4 brands can be used to ensure costs are kept at a minimum. Also, if aging has no negative effect on the tastiness, then the oil can be used for a longer permissible period. The second question’s purpose is to answer what affects the overall liking of the french fries like taste, temperature, appearance, color, texture and which of these features needs improvement.
Kirsten has taken only one course of statistics in her undergraduate program, she is a bit lost about finding out answers for her research questions. Her statistical questions are as follows
Which technique(s) to use to answer the research questions
How to combine and transform the data
Dataset originally consisted of 33 variables. We have removed redundant and nonsensical variables from the analysis. Below are the variables that we will analyze in the exploratory data analysis.
In our initial exploration of the data, we noticed that Week 1 and Week 2 have different number of variables measured. In the first week, Allergies (Peanut, Fish, etc) are recorded, as well as Cold/Pregnancy, Caffeine and Alcohol. These variables are not measured in Week 2. Upon discussion within our group, we decided that these variables would not aid us in addressing Kirsten’s two research questions in the final statistical analysis and decided to remove them. We also removed several redundant variables, e.g. brand was recorded in three different ways. The variable Comments is useful as a response, but we feel that the Overall Liking variable captures almost the same information and therefore Comments will be removed. Please refer to section 1.3: Variables of Interest to again see the variables that we used in both our EDA and statistical analysis, which are 12 variables measured over the 1,868 samples taken over the two weeks.
The variable Gender has female tasters sampled 972 times while male tasters are sampled only 896 times. Please see Table 1: Gender over 10 days. Perhaps this imbalance would have to be taken into consideration, However, upon looking at the structure of Overall Liking conditioned on Gender, we observed that the underlying structure of the data is very similar. Please observe Plot 1: Overall Liking conditioned on Gender, to see this structure. The means of Male and Females are roughly the same at 6.3 with a standard deviation of 1.40 and 1.47 respectively. Upon further discussion within our team, we have decided that for the research questions you have asked us to explore, gender is not a useful variable for the statistical analysis. However, it will be included in our statistical analysis, but we are not concerned with a gender effect for the research questions.
Age of Oil shows an imbalance in the number of samples taken over the two weeks. Please observe in Table 2 and Table 3, Age 3 Oil as being under sampled with only a 72 samples taken in week 1 and 72 again in week 2. Age 1 Oil has close to 700 samples taken over the course of the two weeks. We should take this information into consideration when looking at our final results from our statistical analysis. We also feel that future studies that seek to understand the effects of Age of Oil on “tastiness” of a french fry should try and balance age of oil as much as possible
Next, find Plot 2: Overall Liking conditioned on Age of Oil. We observe that all of the boxplots of the first four age of oil have a somewhat similar structure, i.e. their medians are at 6. Age 1 and Age 3 have some variation in them and Age 2 and Age 4 have a few outliers. The Age 5 oil has its median at around 7. The white dot overlaid on the boxplot is the mean of the Overall.Liking for each Age of Oil. We observe in the first four plots the mean being greater than the median indicating some skewness. The last age of oil 5 has the mean under the median, but this is probably due to the outlier and it pulling the mean down. We suspect that age of oil has an effect on the Overall.Liking as age 5 has the bulk of its data around a liking of 7 while the others do not. Also, observe the white dot increasing slightly as Age of Oil changes.
Another concern that we have with the data is with the variable Preference. We observe in Plot 3: Preference that batch 1 is preferred over batch 2. Perhaps this indicates that tasters are biased towards the first sample regardless of how much they liked the fries from their two batches. Before any statistical analysis is done, this issue must be investigated. If we were to find that preferences is not statistically significant, then the preference variable will be removed from the final analysis.(Please see Appendix for analysis) Also, there is some concern visually that the two weeks are different from each. Please see Appendix for “Week Effect” analysis.
The variable Day is one that we constructed so that we could condense the data into one spreadsheet to make it easier to feed into our proposed models. We are concerned that Day, i.e. the different days the experiment was conducted on, could have an effect on the data as well. We observe variation in the box plots in Plot 4: Days conditioned on Overall Liking and this is concerning. The effect of the different days on the data is not wanted or ideal as it does not help to answer the research questions. Day4,Day5, Day6, and Day8 have higher medians than the other days. The model we use will be able to address this issue.
As mentioned in 1.3 Variables of Interest, we are concerned with how to interpret the variables Color and Temperature. Both are on the scale of 1 to 9, but 5 is our target outcome. The bulk of the data is at 5, which is ideal. To help simplify our interpretation we have transformed Color from scale of 1-9 to 1-5 since 5 is the best value and it deteriorates again from 6-9, therefore we have substituted 6 as 4, 7 as 3, 8 as 2 and 9 as 1. Variable Temperature had the same transformation done to it.
Observe in Plot 5: Correlation of Feature Variables, that several of our five feature variables are highly positively correlated with Overall Liking(y). These variables are Texture(tx), Taste(ta) and Appearance(ap). The correlation of two variables implies that they are somehow linked together, but we can not draw any causal relationship through this plot. However, our later regression and anova models will allow for us to explain any causal relationship. We would expect that the 3 variables that are highly correlated with Overall Liking to show up in our models.
Below are the five variables that have ranks from 1-9 with 9 being the best. Observe that the fries for all the variables hover around 6, which means there is room for improvement for the fries.
The exploratory data analysis we have done in this section helps to guide our intuition in choosing models for our statistical analysis. These models will in turn help us draw sound and impactful conclusions about the french fry data. The models in our statistical analysis will include 10 variables in them, which have been reduced from the original variable set of 33. These 10 variables, we believe, will best answer your research questions in a parsimonious and interpretable way. We hope that the EDA is clear and succinct so that you, our client, can understand why we have chosen these 10 variables.
To understand the effect of oil brand and oil age on the overall liking, we have fitted an ANOVA model with the overall liking as a continuous response variable. For the explanatory variables, we included the variable Age, Brand and the interaction between them as fixed effects. Considering that the data recorded from the same day are correlated, we included the variable Day (nested in Brand) as a random effect in the model. The results are summarized in Table 4. Age has a significant effect on the overall liking while brand and their interaction are not significant. Day also significantly affects the overall liking. The mean overall liking and its standard error for each age level are plotted in Plot 6. Fries made with oil aged 5 are most preferred by the participants while the ones made with age 1 oil are least favored. There is more variation in the evaluation of fries made with age 3 oil than other ages which is likely due to less samples in this group.
Table 4. Summary of the age and brand effects on the overall liking. The age of oil significantly affects the overall liking at the 0.05 significance level with p-value close to 0. Oil brand has no statistically significant effect on the overall liking and neither does the interaction between age and brand.
Plot 6. Summary of the mean overall liking and standard error for each oil age. Age 5 has the highest score in overall liking while age 1 has the lowest score. Age 3 has the largest variation while age 1 has the minimum variation among all 5 ages.
To identify people’s preference on the age of oil, we compared the mean overall liking of 5 different oil ages by applying the Tukey method. By comparing the differences of the mean overall liking between each pair of oil ages to the Tukey statistic, we can determine the significance of pairwise differences. The results are summarized in Table 5 in which the oil age is indicated in the first column “Age” and the mean overall liking is in the second column “Estimate”. The oil ages are ordered by descending mean overall liking, that is, the oil of age 5 has the highest average score in overall liking. The letters in the next 2 columns indicate the grouping of these 5 oil ages. Oil age levels which share no same letters are significantly different in overall liking. Here, age 5 is significantly better than any other oil age.
Table 5. Summary of oil age comparison in mean overall liking by the Tukey method. The ages of oil are ordered by descending mean overall liking. Age 5 has a significantly higher score in overall liking than any other oil ages. Age 4 is the second best which is significantly better than age 1 but is not significantly different from age 2 and 3. Age 2 and age 3 are not significantly better than age 1.
Now, we check which features ( temperature, color, taste, texture, appearance) of the French fries are important or can be improved. We fit a linear model for the overall liking (y) with respect to all the features and surprisingly all the features came out to be significant and the model is a good fit. As seen from the correlation plot above, all the features are positively correlated with overall liking and the linear model also agrees with it. In the linear model all the features have positive effect on overall liking of which taste has the most effect. As you can see in the graph, Plot 5:Boxplots of Features, it is not possible to improve with respect to color or temperature since their mean and median are close to 5 (optimum value). But there is a lot of scope to improve upon taste, temperature and appearance.
Model : We fit a model for overall liking using all the 5 features.
Note: We did consider interactions between the features, but they did not have a significant effect.
4.0 - RECOMMENDATIONS
Since the brand of the oil has no effect on the overall liking, the cheapest oil should be used to make the french fries, that would be more cost efficient. Also as the aging of the oil has a positive effect on the overall liking hence the oil should be aged as much as possible until the permissible limit (here it’s age 5). All the features are important, but taste, texture and appearance have the most room for improvement especially taste since it has the most influence.
Future recommendation: If the permissible aging for the oil is more than 5 then that is an area which can be explored in a future study.
5.0 - RESOURCES
Statistical concepts :
Bar plots - http://www.statisticshowto.com/what-is-a-bar-chart/ T-test - http://www.statisticshowto.com/probability-and-statistics/t-test/ ANOVA - http://www.itl.nist.gov/div898/handbook/prc/section4/prc433.htm Linear Model - https://onlinecourses.science.psu.edu/stat500/node/64 Correlation - https://www.r-bloggers.com/correlation-and-linear-regression/
Statistical software: R - https://www.rstudio.com/ SAS - https://www.sas.com/en_us/home.html
The data were collected on campus, it is not reasonable to generalise it to the whole population. We are taking overall liking (y) to be continuous and normally distributed. We assumed that each sample-set is tasted by different participants in this analysis as the identity of participants is not recorded in this data set. The data collected from this study is not strictly independent as 2 records were taken from each participant. However, due to the relatively large sample size, we are not very much concerned about violating the independence assumption.
Acknowledgement: We want to thank Kirsten for giving us an opportunity to work with you in this project. It was a great experience for us to analyze data for a real project and to get a taste of food science and marketing field. Hope you are satisfied with our work.
Each participant is having a tasting of two batches of french fries cooked in two different brands of oil. From the EDA, Plot 2: Preference, it looks like maybe there is a trend that most of the participants like the first sample more. If true, this might present a problem in our statistical analysis. Hence, we utilized a t-test for paired data. A paired t-test is used to compare two population means where you have two samples in which observations in one sample can be paired with observations in the other sample. We assume that Null Hypothesis is that the two samples are the same. For your data, we conducted the paired t-test and found that both the tastings are similar. This is a good thing, because we found out that the first tasting does not influence the second, so we do not need to make any adjustment.
Welch Two Sample t-test
data: w1$y and w2$y
t = 0.57264, df = 1799.3, p-value = 0.567
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.09180256 0.16751684
sample estimates:
mean of x mean of y
6.313095 6.275238
Welch Two Sample t-test
data: p1$y and p2$y
t = -1.0554, df = 1450.1, p-value = 0.2914
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.22411707 0.06731496
sample estimates:
mean of x mean of y
6.179389 6.257790
Output 1: t = 0.55539, df = 944, p-value = 0.5788
Welch Two Sample t-test
data: w1$y and w2$y
t = 0.57264, df = 1799.3, p-value = 0.567
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.09180256 0.16751684
sample estimates:
mean of x mean of y
6.313095 6.275238
T-test reports that the Null Hypothesis can not be rejected, i.e. there is no evidence to suggest that the two weeks are different from each other in terms of overall liking.